CN113763962A

CN113763962A - Audio processing method and device, storage medium and computer equipment

Info

Publication number: CN113763962A
Application number: CN202110504549.4A
Authority: CN
Inventors: 曹爽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-12-07

Abstract

The embodiment of the invention discloses an audio processing method, an audio processing device, a storage medium and computer equipment, wherein the audio data under the current environment is acquired, and the audio data comprises at least one piece of voiceprint information; dividing the audio data into a plurality of sub-audio data based on a difference between voiceprint information in the audio data; determining sub audio data of which the voiceprint information meets a preset condition as target sub audio data; and scoring the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data. Therefore, the audio data are divided by adopting a voiceprint recognition technology according to the difference between the voiceprint information in the audio data, the sub-audio data of which the voiceprint information meets the preset condition is determined as the target sub-audio data, and then the score of the target sub-audio data is determined as the score of the audio data. Therefore, the interference of noise to the scoring system can be avoided, the accuracy of audio processing is improved, and the accuracy of scoring the audio data is improved.

Description

Audio processing method and device, storage medium and computer equipment

Technical Field

The invention relates to the technical field of audio processing, in particular to an audio processing method, an audio processing device, a storage medium and computer equipment.

Background

In recent years, with the development of internet technology and the popularization of intelligent terminals, the online education industry is one of industries rising rapidly with the rapid development of the internet industry, and the occurrence of online education breaks through the regional limitation of education resources, so that students can obtain richer education resources, and the unbalanced distribution condition of the education resources is relieved.

In the current online education products, especially in the products for children English learning, when the learning voice of children is collected to score the learning condition of the children, the scoring result is often inaccurate because the reading voice of parents is collected.

Disclosure of Invention

The embodiment of the application provides an audio processing method, an audio processing device, a storage medium and computer equipment. Therefore, the accuracy of audio data processing can be improved, and the accuracy of audio data scoring is further improved.

A first aspect of the present application provides an audio processing method, including:

acquiring audio data under the current environment, wherein the audio data comprises at least one piece of voiceprint information;

segmenting the audio data into a plurality of sub-audio data based on a difference between voiceprint information in the audio data;

determining sub audio data of which the voiceprint information meets a preset condition as target sub audio data;

and scoring the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data.

Accordingly, a second aspect of the present application provides an audio processing apparatus comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio data under the current environment, and the audio data comprises at least one piece of voiceprint information;

a dividing unit configured to divide the audio data into a plurality of sub audio data based on a difference between voiceprint information in the audio data;

the determining unit is used for determining the sub-audio data of which the voiceprint information meets the preset condition as target sub-audio data;

and the scoring unit is used for scoring the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data.

In some embodiments, the determining unit includes:

the first determining subunit is used for determining a first number of sub-audio data of which the voiceprint information meets a preset condition;

and the splicing subunit is used for splicing the first number of sub-audio data according to a time sequence to obtain target sub-audio data.

In some embodiments, the determining unit includes:

the matching subunit is used for matching the voiceprint information of each piece of sub-audio data with the preset voiceprint information;

and the second determining subunit is used for determining the sub-audio data of which the voiceprint information is matched with the preset voiceprint information as the target sub-audio data.

In some embodiments, the apparatus further comprises:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voiceprint information set of a current user, and the voiceprint information set comprises a plurality of voiceprint information and acquisition time of each voiceprint information;

and the prediction unit is used for predicting the voiceprint information corresponding to the current time according to the voiceprint information set and determining the voiceprint information corresponding to the current time as the preset voiceprint information.

In some embodiments, the prediction unit comprises:

the training subunit is used for training a preset voiceprint information prediction model by adopting a voiceprint information training sample, wherein the voiceprint information training sample comprises voiceprint information packets of a plurality of users, and the voiceprint information packets comprise voiceprint information acquired by the users at different times;

and the predicting subunit is used for predicting the voiceprint information corresponding to the current time of the current user based on the trained preset voiceprint information predicting model and the voiceprint information set.

In some embodiments, the determining unit includes:

an extraction subunit, configured to extract a tone feature in the voiceprint information corresponding to each piece of sub-audio data;

and the third determining subunit is used for determining the sub-audio data with the tone characteristic matched with the preset tone characteristic as the target sub-audio data.

In some embodiments, the third determining subunit includes:

the acquisition module is used for acquiring age section data corresponding to each tone feature;

and the determining module is used for determining the target sub-audio data corresponding to the preset age data.

The third aspect of the present application further provides a computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the audio processing method provided in the first aspect of the present application.

A fourth aspect of the present application provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the audio processing method provided by the first aspect of the present application when executing the computer program.

A fifth aspect of the present application provides a computer program product or computer program comprising computer instructions stored in a storage medium. The processor of the computer device reads the computer instructions from the storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of the audio processing method provided by the first aspect.

According to the audio processing method provided by the embodiment of the application, the audio data under the current environment is collected, and the audio data comprises at least one piece of voiceprint information; dividing the audio data into a plurality of sub-audio data based on a difference between voiceprint information in the audio data; determining sub audio data of which the voiceprint information meets a preset condition as target sub audio data; and scoring the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data. Therefore, the audio data are segmented according to the difference between the voiceprint information in the audio data, the sub audio data of which the voiceprint information meets the preset condition are determined as the target sub audio data, and then the score of the target sub audio data is determined as the score of the audio data. Therefore, the interference of noise to the scoring system can be avoided, the accuracy of audio processing is improved, and the accuracy of scoring the audio data is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a scenario of audio processing provided herein;

FIG. 2 is a schematic flow chart of an audio processing method provided herein;

FIG. 3 is another schematic flow chart diagram of an audio processing method provided by the present application;

FIG. 4 is a schematic diagram of an audio processing apparatus provided in the present application;

FIG. 5 is a schematic view of a display scene of a visualization terminal;

FIG. 6A is a schematic diagram of a parental mode login page;

FIG. 6B is another schematic diagram of a parental mode login page;

fig. 7 is a schematic structural diagram of a computer device provided in the present application.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides an audio processing method, an audio processing device, a computer readable storage medium and computer equipment. The audio processing method can be used in an audio processing device. The audio processing apparatus may be integrated in a computer device, which may be a terminal or a server. The terminal can be a mobile phone, a tablet Computer, a notebook Computer, an intelligent television, a wearable intelligent device, a Personal Computer (PC), and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Network acceleration service (CDN), big data and an artificial intelligence platform.

Please refer to fig. 1, which is a schematic view of an audio processing scenario provided in the present application; as shown, the computer device captures audio data in the current environment, which may contain a plurality of voiceprint information. After audio data under the current environment are obtained, voiceprint information contained in the audio data is obtained, then the audio data are divided into a plurality of sub audio data according to the difference between the voiceprint information, the sub audio data with the voiceprint information meeting the preset conditions are determined to be target sub audio data, then the target sub audio data are scored according to the preset scoring rules, and the score corresponding to the audio data is obtained.

It should be noted that the scene schematic diagram of audio processing shown in fig. 1 is only an example, and the audio processing scene described in the embodiment of the present application is for more clearly illustrating the technical solution of the present application, and does not constitute a limitation on the technical solution provided by the present application. As will be appreciated by those skilled in the art, with the evolution of audio processing and the emergence of new service scenarios, the technical solutions provided in the present application are also applicable to similar technical problems.

Based on the above-described implementation scenarios, detailed descriptions will be given below.

Embodiments of the present application will be described from the perspective of an audio processing apparatus, which may be integrated in a computer device. The computer device may be a terminal or a server, among others. As shown in fig. 2, a schematic flow chart of an audio processing method provided in the present application is shown, where the method includes:

step 101, collecting audio data in the current environment.

The audio processing method provided by the application can be applied to an online education application program, and the online education application program can be used for online teaching of various courses, including but not limited to online teaching of language courses, science courses, literature courses and art courses. In some cases, the online education application may assess the learning situation of the student so that the student can master his or her learning situation in real time. Specifically, for example, in language class course learning, specifically, in an english teaching course, the application program may collect audio data of a student reading an english text, and then score the collected audio data, and the student may know how to grasp the part of english according to the scoring condition. However, when facing some students in a special group, such as children, the students cannot independently complete the learning course due to limited learning ability, and generally need to take the reading by parents. Therefore, when the application program collects the audio data read by the students, the voice read by the family belt can be collected, and scoring the voice can result in inaccurate scoring data and incapability of truly embodying the learning effect of the students.

Therefore, the audio processing method is provided to avoid the influence of the voice with reading on the scoring result and improve the accuracy of scoring the audio data. Specifically, the audio processing method provided by the application determines the target sub-audio data of which the voiceprint information in the audio data meets the preset condition, and then scores the target sub-audio data, so as to obtain the score corresponding to the audio data. The following describes the embodiments of the present application in detail.

And when the application program is in the teaching mode, acquiring the audio data in the current environment in real time. The audio data may contain only one piece of voiceprint information or may contain a plurality of pieces of voiceprint information. When a plurality of voiceprint information is contained in the audio data, the voiceprint information may be divided. Specifically, the voiceprint information can be classified into different categories by category. For example, a plurality of voiceprint information contained in audio data is divided into male voiceprint information and female voiceprint information; or dividing the voiceprint information into child voiceprint information, teenager voiceprint information and adult voiceprint information. More detailed division may be performed according to a plurality of constraint conditions, for example, division into male child voiceprint information and female child voiceprint information. And further dividing the voiceprint information of the same class obtained by classification according to the details of the voiceprint information to obtain a plurality of single voiceprint information. For example, when detecting the voiceprint information of the father, mother, son, and daughter in the collected audio data, the voiceprint information can be divided into an adult voiceprint and a child voiceprint, so that the voiceprint information of the father and mother belongs to the adult voiceprint class, and the voiceprint information of the son and daughter belongs to the child voiceprint class. Then, the voiceprint information of the children voiceprint class can be further determined to distinguish which voiceprint corresponds to the voiceprint of the children and which voiceprint corresponds to the voiceprint of the children.

Step 102, the audio data is divided into a plurality of sub audio data based on the difference between the voiceprint information in the audio data.

When detecting that a plurality of pieces of voiceprint information exist in the audio data, extracting the voiceprint information in the audio data to obtain the plurality of pieces of voiceprint information. Then, the sub audio data corresponding to each voiceprint information is determined one by one from the audio data, so that the audio data is divided into a plurality of sub audio data according to the difference between the voiceprint information.

Wherein the sub-audio data may or may not intersect in the time dimension. For example, the total duration of the audio data is T, time nodes T1 and T2 exist between 0 and T, and T2 is greater than T1, so the total duration of the audio data is divided into three time periods of 0 to T1, T1 to T2, and T2 to T. Then there may be one sub-audio data corresponding to the voiceprint information or there may be a plurality of sub-audio data corresponding to the voiceprint information during the time period 0 to t 1. The sub audio data corresponding to one voiceprint information may exist only in one time period, or may exist in a plurality of time periods. When sub-audio data corresponding to one piece of voiceprint information exists in a plurality of time periods, the sub-audio data may be continuous or discontinuous.

And 103, determining the sub-audio data of which the voiceprint information meets the preset condition as target sub-audio data.

Since the sub-audio data are divided according to the difference between the voiceprint information in the audio data, the sub-audio data can be screened to determine the target sub-audio data, and the target sub-audio data can also be screened from the dimension of the voiceprint information.

Determining the sub-audio data of which the voiceprint information meets the preset condition, and specifically matching the voiceprint information of each sub-audio data with the preset voiceprint information. And determining the voiceprint information with the matching similarity larger than a preset threshold as the target sub-audio data. Each sub audio data may also be classified according to the type of the voiceprint information, as described above, the voiceprint information may be classified into male voiceprint information and female voiceprint information, and then the sub audio data may also be classified into sub audio data corresponding to the male voiceprint information and sub audio data corresponding to the female voiceprint information according to the type of the voiceprint information. And determining the sub-audio data meeting the preset condition according to the classified categories.

In some embodiments, determining the sub-audio data with the voiceprint information satisfying the preset condition as the target sub-audio data includes:

1. matching the voiceprint information of each sub-audio data with preset voiceprint information;

2. and determining the sub-audio data of which the voiceprint information is matched with the preset voiceprint information as target sub-audio data.

In the embodiment of the application, the target sub-audio data is determined by matching the voiceprint information of each sub-audio data with the preset voiceprint information. Before the audio data is collected and scored by using the application program, the preset voiceprint information needs to be set on a setting interface of the application program. Specifically, the voiceprint information collection can be clicked on a setting interface of the application program, and then the audio data of the single voiceprint information is input in a quiet environment. For example, the recording interface prompts the user to read or follow the individual words, the words or the sentences, then the application program collects the read voice or the follow-read voice, then performs voiceprint information extraction on the collected audio data, and sets the extracted voiceprint information as preset voiceprint information. In some embodiments, the application program may further perform voiceprint information extraction for multiple times to collect more detailed voiceprint information features, and finally determine and store a preset voiceprint information. Specifically, since the voiceprint information is composed of a plurality of features, the features of the voiceprint information include, but are not limited to, tone, timbre, and the like. If the application program determines a plurality of voiceprint information features by using the voice read once or read with the user, the influence of random factors on the collected voiceprint features is large, and the collected voiceprint information features may not be accurate enough. Therefore, the application program can enable the user to repeatedly read a sentence or repeatedly read a piece of voice, and then extract the characteristics of the voiceprint information according to the repeatedly read or read multiple pieces of audio data. For example, three segments of audio data may be collected, then a tone feature of voiceprint information is extracted from each segment of audio data, and finally a common part of the three extracted tone features is extracted as a tone feature of preset voiceprint information. Therefore, more accurate preset voiceprint information can be obtained.

After the preset voiceprint information is set, the application program acquires the audio data under the current environment and extracts the target sub-audio data matched with the preset voiceprint information from the audio data.

In some embodiments, before matching the voiceprint information of each sub-audio data with the preset voiceprint information, the method further includes:

1.1, acquiring a voiceprint information set of a current user, wherein the voiceprint information set comprises a plurality of voiceprint information and acquisition time of each voiceprint information;

and 1.2, predicting the voiceprint information corresponding to the current time according to the voiceprint information set, and determining the voiceprint information corresponding to the current time as preset voiceprint information.

In the embodiment of the application, the application program can predict the voiceprint information of the current time according to the historical voiceprint information without manually setting the preset voiceprint information in the application program by a user, and the predicted voiceprint information is determined as the preset voiceprint information.

Specifically, each time the user enters the tutorial mode, the application may determine the target user to log in based on the user account of the login application. The login application program can log in through an account password, can log in through a mobile phone number and an authentication code, can also log in through other third-party applications (such as a communication application program), and can also log in through a voice login method. Specifically, a voice login method is adopted for login, a login interface of a target user can be displayed in a client of an application program, and input voice information is received in response to touch operation of a voice login control in the login interface. Then, carrying out voice recognition on the input voice information, recognizing the content in the voice information, matching the recognized voice content with the preset voice content, and allowing the target user to log in the application program when the recognized voice content is matched with the preset voice content; otherwise, the target user is not allowed to log in the application program. The method may also be understood as setting a voice password by which the target user logs in to the application. Or, in some embodiments, a voice login method is used for login, voiceprint information of received voice information can be extracted, then the extracted voiceprint information is matched with preset voiceprint information, when the extracted voiceprint information is matched with the preset voiceprint information, the target user is allowed to login the application program, otherwise, the target user is not allowed to login the application program. The method can also be understood as setting a voiceprint password, and a user can log in the application program by inputting a voice corresponding to the voiceprint password.

After the user logs in the application program, the application program determines that the sub audio data needing to be extracted from the audio data should be the sub audio data matched with the voiceprint information of the target user according to the sub audio data, and then scores the sub audio data matched with the voiceprint information of the target user. At the moment, the application program extracts the voiceprint information of the target user from the historical voiceprint information base, the voiceprint information of the target user is multiple, and each voiceprint information corresponds to one acquisition time.

As the voiceprint information of the target user changes along with the age of the target user, the application program can acquire the change rule in advance, predict the voiceprint information of the target user at the current time by combining the historical voiceprint information of the target user at different time periods, and determine the voiceprint information of the current time as the preset voiceprint information. The voiceprint information collected by the user at different times can be obtained first, and a voiceprint information set consisting of the time stamp at each time and the corresponding voiceprint information is obtained. The voiceprint information set comprises a plurality of data pairs, each data pair comprising a time stamp and a voiceprint information. For example, the set of voiceprint information includes: a first data pair consisting of a timestamp 1 and voiceprint data 1, a second data pair consisting of a timestamp 2 and voiceprint information 2, and a third data pair consisting of a timestamp 3 and voiceprint information 3 are equal. Then, the change rule of the user voiceprint information along with the change of time can be analyzed according to a plurality of data pairs in the voiceprint information set. And then predicting the voiceprint information corresponding to the current time according to the change rule. Each piece of voiceprint information in the voiceprint information set can be divided according to a plurality of feature dimensions, then the rule of each voiceprint feature changing along with time is analyzed, and then voiceprint feature data corresponding to each voiceprint feature corresponding to the current time are predicted. And finally, combining the predicted voiceprint characteristic data to obtain preset voiceprint information.

In the embodiment of the application, the user is not required to set the preset voiceprint information by himself, the audio processing efficiency is improved, and the operation experience of the user is improved. And when the user does not use the application program for a long time and then uses the application program again, the preset voiceprint information can still be automatically determined without resetting by the user, the target sub-audio data matched with the voiceprint of the user can be accurately extracted from the voice data, and the processing efficiency of the audio data is greatly improved.

In some embodiments, predicting voiceprint information corresponding to the current time according to the voiceprint information set includes:

1.1.1, training a preset voiceprint information prediction model by adopting a voiceprint information training sample, wherein the voiceprint information training sample comprises voiceprint information packets of a plurality of users, and the voiceprint information packets comprise voiceprint information acquired by the users at different times;

and 1.1.2, predicting the voiceprint information corresponding to the current time of the current user based on the trained preset voiceprint information prediction model and the voiceprint information set.

In this embodiment, the voiceprint information of the current time of the user can be predicted through the voiceprint prediction model. For example, a voiceprint information packet of a user may be obtained first, the voiceprint information packet of the user includes voiceprint information obtained by the user at different time points, time for obtaining the voiceprint information of the user is used as input data of the model, voiceprint information data obtained by the user at each time is used as output data of the model, and then the voiceprint prediction model is trained by using the input data and the output data to obtain the trained voiceprint prediction model. After the training is finished, any time point is input into the trained voiceprint prediction model, and then the predicted voiceprint information corresponding to the time point can be output. If the current time point is input into the voiceprint prediction model, the voiceprint information corresponding to the current time point can be obtained. Then, the predicted voiceprint information can be set as preset voiceprint information, the obtained sub-voiceprint information is matched with the preset voiceprint information, and the matched sub-voiceprint information is determined to be the target sub-voiceprint information.

In some embodiments, determining the audio data with the voiceprint information satisfying the preset condition as the target sub-audio data includes:

A. extracting tone features in the voiceprint information corresponding to each sub-audio data;

B. and determining the sub-audio data with the tone characteristics matched with the preset tone characteristics as target sub-audio data.

When children learn english as in the foregoing example, because reading cannot be independently completed or reading can be followed along with artificial intelligence in the application program, parents are required to take the reading, and the voice of the parents is mixed into the collected audio data, which may cause the result of scoring the pronunciation of the children by the application program to generate deviation. Therefore, at this time, it is necessary to extract audio data of the child from the acquired audio data and screen out audio data of the parent. Because children and parents' tone characteristic has obvious difference, consequently distinguish the sub audio data that children and parents correspond according to the tone characteristic in the voiceprint information in this application.

Specifically, the voiceprint information according with the child tone can be set in the application program as the preset voiceprint information, and the voiceprint information according with the child tone can also be set by the application program as the preset voiceprint information. And the application program extracts sub audio data of which the voiceprint information conforms to the child tone from the acquired audio data to serve as target sub audio data, and then scores the target sub audio data.

In an embodiment of the application, the voiceprint information of each child does not need to be collected as the preset voiceprint information, and the preset voiceprint information can be judged only by determining that the voiceprint information accords with the tone characteristics of the child. So make this application can both be suitable for most children, have better compatibility.

In some embodiments, determining the sub-audio data whose pitch characteristic matches the preset pitch characteristic as the target sub-audio data includes:

a. acquiring age section data corresponding to each tone feature;

b. target sub-audio data corresponding to the preset age group data is determined.

In some embodiments, more specifically, after the tone feature corresponding to each voiceprint data is obtained, the age data corresponding to the tone feature may be determined, for example, 0 to 3 years old, 3 to 8 years old, 8 to 18 years old, and the like. And then further determining target sub-audio data according with the preset age group according to the age group data corresponding to the tone feature of each voiceprint data. Therefore, the application program can set the function of automatically acquiring the target sub-audio data aiming at users of different ages and eliminating the interference of other audio data.

And 104, scoring the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data.

After the target sub-audio data with the voiceprint information meeting the preset conditions is determined, the application program scores the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the target sub-audio data, and the score is the score corresponding to the audio data.

1. determining a first number of sub-audio data of which the voiceprint information meets a preset condition;

2. and splicing the first number of sub-audio data according to the time sequence to obtain the target sub-audio data.

Wherein, because the child may read a part in the follow-up reading process and then the reading is interrupted because of forgetting or not firmly mastering, the parent is required to remind again at the moment or take the reading again, and then the child continues to read again. This can result in the audio clip of the child being interrupted, resulting in the production of a plurality of fragmented sub-audio clips. That is, the target sub audio data for which the voiceprint information satisfies the preset condition may be a plurality of sub audio data. When any sub-audio data is scored, the scoring score is reduced due to incomplete pronunciation, so that the scoring result is inaccurate.

Therefore, in the embodiment of the present application, when there are a plurality of sub audio data whose voiceprint information satisfies the preset condition, the plurality of sub audio data may be obtained respectively. Then, the sub audio data are spliced according to the time sequence to obtain a complete sub audio data as the target sub audio data. Therefore, the situation that scoring is affected due to incomplete audio is avoided, and the scoring result is more accurate.

According to the foregoing description, in the audio processing method provided in the embodiment of the present application, by acquiring audio data in a current environment, the audio data includes at least one piece of voiceprint information; dividing the audio data into a plurality of sub-audio data based on a difference between voiceprint information in the audio data; determining sub audio data of which the voiceprint information meets a preset condition as target sub audio data; and scoring the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data. Therefore, the audio data are segmented according to the difference between the voiceprint information in the audio data, the sub audio data of which the voiceprint information meets the preset condition are determined as the target sub audio data, and then the score of the target sub audio data is determined as the score of the audio data. Therefore, the interference of noise to the scoring system can be avoided, the accuracy of audio processing is improved, and the accuracy of scoring the audio data is improved.

Accordingly, the embodiment of the present application will further describe in detail the audio processing method provided by the present application from the perspective of a computer device, where the computer device may be a terminal or a server. As shown in fig. 3, another schematic flow chart of the audio processing method provided by the present application is shown, where the method includes:

in step 201, in response to a voice instruction to enter an online education application, a computer device opens the online education application.

The computer device may be a terminal or a server. When the Computer device is a terminal, the Computer device may be a mobile phone, a tablet Computer, a notebook Computer, a smart television, a wearable smart device such as a smart watch, a Personal Computer (PC), or a touch and talk pen. In particular, in the present embodiment, the terminal may be a touch and talk pen. After the touch and talk pen is started, the touch and talk pen can acquire voice information in real time, the touch and talk pen can be connected with a server background, and the touch and talk pen can process the acquired voice information by adopting a voice technology. Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. When the user speaks to the point-reading pen and opens the English learning application, the point-reading pen can control the English learning application to be opened in the server background. At this point the user can learn. The stylus is only an example here, and may be other terminals as exemplified above. The English learning application is also an example herein, and can be any other online education application.

In the scene that children learn english, the terminal adopts the point-and-read pen that has the speech recognition function that this application provided, both can provide the possibility of online study for the user that has visual disorder, can also reduce children and face the time of display screen, protect children's eyesight.

Step 202, the computer device collects audio data in the current environment.

After the background of the server enters the English learning application program, the touch and talk pen can continue to receive the voice information input by the user. When a user says 'hearing practice' to the touch and talk pen, the background of the server enters a hearing playing mode, and the touch and talk pen starts to play English hearing. When the user says 'English learning' to the point-reading pen, the teaching mode is entered. The touch and talk pen collects audio data in real time under the current environment. After the audio data are collected, the point-reading pen detects the voiceprint information in the collected audio data.

In some cases, when it is detected that the acquired audio data in the current environment only contains one piece of voiceprint information, the point-and-read pen directly scores the audio data, and stores the scoring result. When it is detected that the collected audio data in the current environment contains a plurality of pieces of voiceprint information, the audio data needing to be scored needs to be further determined.

In step 203, the computer device divides the audio data into a plurality of sub-audio data based on a difference between the voiceprint information in the audio data.

When a plurality of pieces of voiceprint information are detected to exist in the audio data, extracting sub-audio data corresponding to each piece of voiceprint information in the audio data according to the difference between the pieces of voiceprint information to obtain a plurality of pieces of sub-audio data. In some cases, one voiceprint information corresponds to one sub-audio data. In other cases, one voiceprint information may correspond to a plurality of sub-audio data. For example, when the child is reading with english, if the english sentence is long, the child forgets the word of the latter half of the sentence when reading with english, and the parent can remind or take to read this moment. Then the parent's voice appears after the child's voice, and then the child's voice appears again. That is, the audio data of this example is composed of three sub audio data, however, there are only two kinds of voiceprint information in the audio data, wherein the voiceprint information of the child corresponds to the two sub audio data.

In step 204, the computer device determines a first number of sub-audio data for which the voiceprint information meets a preset condition.

When the audio data includes a plurality of voiceprint information, after the audio data is divided into a plurality of sub-audio data according to the voiceprint information, the sub-audio data of which the voiceprint information meets the preset condition needs to be further determined. There may be one or more sub audio data whose voiceprint information satisfies a predetermined condition. And when the sub-audio data of which the voiceprint information meets the preset condition is one, determining that the sub-audio data is the target sub-audio data. When there are a plurality of sub-audio data whose voiceprint information satisfies the preset condition, the target sub-audio data needs to be further determined according to the plurality of sub-audio data. The number of sub audio data obtained by dividing the audio data according to the voiceprint information is M, the number of sub audio data with the voiceprint information meeting the preset condition is N, and M is larger than N.

The determining that the voiceprint information satisfies the preset condition may be matching the voiceprint information of each sub-audio data with a preset voiceprint information. And when the similarity between the voiceprint information of the sub-audio data and the preset voiceprint information is greater than a preset threshold value, determining that the voiceprint information of the sub-audio data meets a preset condition. Or, the voiceprint information of each piece of sub-audio data may be subjected to feature extraction of a preset dimension, for example, extracting a pitch feature in the voiceprint information of each piece of sub-audio data. And matching the extracted tone features with preset tone features, and determining that the voiceprint information of the sub-audio meets preset conditions when the tone features in the voiceprint data of the sub-audio are matched with the preset tone features.

Step 205, the computer device splices the first number of sub-audio data to obtain the target sub-audio data.

When one voiceprint data corresponds to a plurality of sub-audio data, the plurality of sub-audio data corresponding to the voiceprint data may not be continuous. Since the scoring of the audio data is generally performed by comparing the similarity between the collected audio data and the preset audio data, the score is determined according to the similarity. Therefore, if only one of the sub-audio data is scored, the scoring may be different due to data missing. Specifically, for example, when a child learns "I wait to make some friends", the learning process may be interrupted due to the limited learning ability of the child, and a parent may be required to remind or read the child. Eventually, it is possible that this sentence consists of three sub audio data, i.e., "I wave to", "make", and "some friends", respectively. However, when scoring is performed by calculating the similarity between the sub-audio data and the "I wave to make sound friends", the score of any one of the sub-audio data is decreased due to the data missing.

To solve the above problem, the present embodiment proposes the following solutions: and after a plurality of sub audio data of which the voiceprint information meets the preset conditions are determined, splicing according to the time sequence of each sub audio data collected in the audio data, and taking the audio data obtained by splicing as the target sub audio data.

And step 206, the computer device scores the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data.

After the target sub-audio data is determined, the point-and-read pen may perform similarity calculation on the target sub-audio data and preset comparison data, and determine score data corresponding to the audio data according to the calculated similarity.

Step 207, the computer device reports the score corresponding to the audio data.

And after the reading pen calculates the score of the audio data, storing the score corresponding to the audio data. Then, the reading pen can also broadcast the score data in voice. Therefore, students with visual impairment can conveniently know the learning effect of the study.

Further, in some embodiments, the stylus pen may also obtain historical learning achievement of the user, for example, a score corresponding to the learning audio data when the same sentence was learned last time. And after the last score is obtained, comparing the last score with the current score, and generating the evaluation voice according to the comparison condition. For example, this scoring is improved over the last scoring, and the user is encouraged to speak. Or the score is reduced from the last score, and the user is reminded of voice of continuous effort, and the like. Therefore, when the reading pen broadcasts the evaluation, the evaluation voice can be further broadcasted, and the use experience of the user is further improved.

According to the above description, the audio processing method provided by the present application starts the online education application by the computer device in response to the voice instruction to enter the online education application; the method comprises the steps that computer equipment collects audio data in the current environment; the computer device dividing the audio data into a plurality of sub-audio data based on a difference between voiceprint information in the audio data; the computer equipment determines a first number of sub-audio data of which the voiceprint information meets a preset condition; the computer equipment splices the sub-audio data of the first number to obtain target sub-audio data; the computer equipment scores the target sub audio data according to a preset scoring rule to obtain a score corresponding to the audio data; and the computer equipment broadcasts the corresponding score of the audio data. Therefore, the audio data are segmented according to the difference between the voiceprint information in the audio data, the sub audio data of which the voiceprint information meets the preset condition are determined as the target sub audio data, and then the score of the target sub audio data is determined as the score of the audio data. Therefore, the interference of noise to the scoring system can be avoided, the accuracy of audio processing is improved, and the accuracy of scoring the audio data is improved.

In order to better implement the above method, the embodiment of the present invention further provides an audio processing apparatus, which may be integrated in a server.

For example, as shown in fig. 4, for a schematic structural diagram of an audio processing apparatus provided in an embodiment of the present application, the audio processing apparatus may include a collecting unit 301, a dividing unit 302, a determining unit 303, and a scoring unit 304, as follows:

the acquisition unit 301 is configured to acquire audio data in a current environment, where the audio data includes at least one piece of voiceprint information;

a dividing unit 302 for dividing the audio data into a plurality of sub-audio data based on a difference between voiceprint information in the audio data;

a determining unit 303, configured to determine that sub-audio data with voiceprint information meeting a preset condition is target sub-audio data;

and the scoring unit 304 is configured to score the target sub-audio data according to a preset scoring rule, so as to obtain a score corresponding to the audio data.

In some embodiments, the determining unit comprises:

and the splicing subunit is used for splicing the first number of sub-audio data according to the time sequence to obtain the target sub-audio data.

In some embodiments, the determining unit comprises:

In some embodiments, the apparatus further comprises:

In some embodiments, a prediction unit, comprises:

In some embodiments, the determining unit comprises:

In some embodiments, the third determining subunit includes:

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, in the audio processing apparatus provided in this embodiment, the acquisition unit 301 acquires audio data in the current environment, where the audio data includes at least one piece of voiceprint information; the division unit 302 divides the audio data into a plurality of sub-audio data based on a difference between the voiceprint information in the audio data; the determining unit 303 determines that the sub-audio data of which the voiceprint information meets the preset condition is the target sub-audio data; the scoring unit 304 scores the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data. Therefore, the audio data are segmented according to the difference between the voiceprint information in the audio data, the sub audio data of which the voiceprint information meets the preset condition are determined as the target sub audio data, and then the score of the target sub audio data is determined as the score of the audio data. Therefore, the interference of noise to the scoring system can be avoided, the accuracy of audio processing is improved, and the accuracy of scoring the audio data is improved.

The embodiment of the present application further provides an audio data processing system, and specifically, the audio data processing system may include a portable terminal, where the portable terminal may be a computer device in the foregoing embodiment, and specifically may be a touch and talk pen; the audio data processing system can also comprise a visual terminal, wherein the visual terminal can be any terminal with a display function, such as a mobile phone, a tablet Computer, a notebook Computer, an intelligent television, a wearable intelligent device, a Personal Computer (PC) and the like; in addition, the audio data processing system further comprises a server, which may be an independent server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Network acceleration service (CDN), big data and an artificial intelligence platform.

The portable terminal can adopt an embedded device to inherit the voiceprint recognition model and the voice recognition model. In this way, the portable terminal can receive voice data and directly perform voice recognition and then control the opening or closing of an application program loaded in the terminal according to the voice data. The portable terminal can also share the received voice data with the visual terminal and the server to control specific contents displayed on a display interface of the visual terminal, and send the voice data to the server to enable the server to perform corresponding processing on the voice data.

In the teaching mode, the portable terminal can collect audio data in the current environment and then send the collected audio data to the server. The audio data are divided into a plurality of sub audio data by the server based on the difference between the voiceprint information in the audio data, then target sub audio data with voiceprint information meeting preset conditions are determined in the plurality of sub audio data, finally the target sub audio data are scored, and the scores are used as scores of the audio data and then sent to the portable terminal and the visual terminal. The portable terminal carries out scoring and broadcasting, and the visual terminal displays scoring results. For example, when the result of scoring based on the target sub audio data is 95 points, a voice "too excellent, your school score of this time is 95 points! ". Alternatively, as shown in fig. 5, which is a schematic view of a display scene of the visualization terminal, as shown in the figure, a score display area 110 may be displayed in the score display page 10 of the visualization terminal, and "tai excellent |" may be displayed in the score display area 110! You's school score of 95 points! "is a word pattern of characters. In some embodiments, a virtual character animation, such as an animation of a loved creature, may also be displayed in the achievement display area. In the animation, the virtual character can also play score broadcasting.

The visual terminal can further have a parent mode, a user can touch a parent mode control on a display interface to enter a parent mode login interface, and a parent can enter the parent mode to purchase a corresponding course. In order to avoid the problem that the child mistakenly touches the parent mode to enter the related course, and the related course is mistakenly purchased, login verification is generally set in the parent mode, and the child can enter the parent mode by the verification pass. In the related art, a password is generally set or a corresponding operation is performed to unlock the parental mode. As shown in fig. 6A, which is a schematic diagram of the parental mode login page, the parental mode login verification page 20 is shown to have a login verification area 210, and a prompt message "please input the following numbers in order" may be displayed in the verification area 210, and the traditional chinese characters of the numbers are displayed. In general, children do not have good recognition capability for traditional Chinese characters, so that the children can be prevented from logging in a parent mode by mistake. However, in some cases, such as when the child knows the parent password or the related operation condition to be performed, the child may mistakenly enter the parent mode. For solving above-mentioned problem, can set up the voiceprint in this application and match whether the head of a family operates in order to discernment, for example need the user to read one section of word, then discern the voiceprint information in this section of pronunciation, whether unanimous with preset voiceprint data, just can get into the head of a family mode when unanimous. Therefore, the wrong consumption caused by the child entering the parent mode by mistake can be avoided. Specifically, as shown in fig. 6B, which is another schematic diagram of the parental mode login page, a login authentication area 210 is shown in the parental mode login authentication page 20, and a prompt message "please read the following sentence" may be displayed in the login authentication area: you are good, morning good, see you happy. "in addition, a reading control 211 may also be displayed in the login verification area 210, and in response to a touch instruction for the reading control 211, voice information starts to be collected. And then matching the voiceprint information in the collected voice information with preset voiceprint information. Then, the preset voiceprint information can be set as the voiceprint information of the parents, and the parents can log in the mode and perform related operations only when the voiceprint information extracted from the collected voice information is matched with the preset voiceprint information.

An embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 7, and is a schematic structural diagram of the computer device provided in the present application. Specifically, the method comprises the following steps:

the computer device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 7 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer device as a whole. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and audio processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, a web page access, and the like), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The computer device may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions as follows:

acquiring audio data under the current environment, wherein the audio data comprises at least one piece of voiceprint information; dividing the audio data into a plurality of sub-audio data based on a difference between voiceprint information in the audio data; determining sub audio data of which the voiceprint information meets a preset condition as target sub audio data; and scoring the target sub-audio data according to a preset scoring rule to obtain a score corresponding to the audio data.

It should be noted that the computer device provided in the embodiment of the present application and the audio processing method in the foregoing embodiment belong to the same concept, and specific implementation of the above operations may refer to the foregoing embodiment, which is not described herein again.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present invention provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the audio processing methods provided by the embodiment of the present invention. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any audio processing method provided in the embodiments of the present invention, the beneficial effects that can be achieved by any audio processing method provided in the embodiments of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

According to an aspect of the application, there is provided, among other things, a computer program product or computer program comprising computer instructions stored in a storage medium. The processor of the computer device reads the computer instructions from the storage medium, and the processor executes the computer instructions, so that the computer device executes the audio processing method provided in the various alternative implementations of fig. 2 or fig. 3.

The audio processing method, apparatus, storage medium and computer device provided by the embodiments of the present invention are described in detail above, and the principles and embodiments of the present invention are explained herein by applying specific examples, and the descriptions of the above embodiments are only used to help understand the method and core ideas of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of audio processing, the method comprising:

2. The method according to claim 1, wherein the determining that the sub-audio data with the voiceprint information satisfying the preset condition is the target sub-audio data comprises:

determining a first number of sub audio data of which the voiceprint information meets a preset condition;

and splicing the sub audio data of the first quantity according to a time sequence to obtain target sub audio data.

3. The method according to claim 1, wherein the determining that the sub-audio data with the voiceprint information satisfying the preset condition is the target sub-audio data comprises:

matching the voiceprint information of each sub-audio data with preset voiceprint information;

and determining the sub audio data of which the voiceprint information is matched with the preset voiceprint information as target sub audio data.

4. The method according to claim 3, wherein before matching the voiceprint information of each sub-audio data with the preset voiceprint information, the method further comprises:

acquiring a voiceprint information set of a current user, wherein the voiceprint information set comprises a plurality of voiceprint information and acquisition time of each voiceprint information;

and predicting the voiceprint information corresponding to the current time according to the voiceprint information set, and determining the voiceprint information corresponding to the current time as preset voiceprint information.

5. The method according to claim 4, wherein the predicting the voiceprint information corresponding to the current time according to the voiceprint information set comprises:

training a preset voiceprint information prediction model by adopting a voiceprint information training sample, wherein the voiceprint information training sample comprises voiceprint information packets of a plurality of users, and the voiceprint information packets comprise voiceprint information acquired by the users at different times;

and predicting the voiceprint information corresponding to the current time of the current user based on the trained preset voiceprint information prediction model and the voiceprint information set.

6. The method according to claim 1, wherein the determining that the audio data with the voiceprint information satisfying the preset condition is the target sub-audio data comprises:

extracting tone features in the voiceprint information corresponding to each sub-audio data;

and determining the sub-audio data with the tone characteristics matched with the preset tone characteristics as target sub-audio data.

7. The method of claim 6, wherein the determining that the sub-audio data with the tone feature matching the preset tone feature is the target sub-audio data comprises:

acquiring age section data corresponding to each tone feature;

target sub-audio data corresponding to the preset age group data is determined.

8. An audio processing apparatus, characterized in that the apparatus comprises:

9. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the audio processing method according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the audio processing method of any of claims 1 to 7 when executing the computer program.