CN109101484B

CN109101484B - Recording file processing method and device, computer equipment and storage medium

Info

Publication number: CN109101484B
Application number: CN201810735639.2A
Authority: CN
Inventors: 岳鹏昱; 闫冬
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-07-06
Filing date: 2018-07-06
Publication date: 2023-04-18
Anticipated expiration: 2038-07-06
Also published as: CN109101484A; WO2020006879A1

Abstract

The invention discloses a recording file processing method, which is used for solving the problems that the cleaning operation efficiency of recording files is low and arrangement irregularity is easy to occur. The method provided by the invention comprises the following steps: acquiring an uploaded sound recording file and an original text corresponding to the sound recording file; calling a voice recognition interface to perform voice recognition on the recording file to obtain a recognition text; judging whether the identification text is consistent with the original text; if the sound recording files are consistent, storing the sound recording files into a preset model training set; if the recorded sound files are inconsistent with the recorded sound files, recording the sound files to the directory to be cleaned, and listening the sound files recorded in the directory to be cleaned by a processor and feeding back correct sound recording texts; acquiring a recording file and a corresponding recording text which are cleaned in a directory to be cleaned; and storing the cleaned sound recording file and the corresponding sound recording text in a model training set in an associated manner. The invention also provides a recording file processing device, computer equipment and a storage medium.

Description

Recording file processing method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of voice recognition, in particular to a recording file processing method and device, computer equipment and a storage medium.

Background

At present, the speech recognition technology is widely applied, and a plurality of platforms are trained with speech recognition models in advance and can provide speech recognition service to the outside. The platform is when training the speech recognition model, need collect a large amount of voice files and provide the speech recognition model as the sample and study and train, consequently, a lot of platforms issue the cell-phone APP customer end of oneself, encourage users to upload the recording as the training sample through this APP. The method can quickly and efficiently collect training samples with great diversity, but has the problem of irregular recording files. Although the sound recording files collected by the platform are obtained by providing original texts through the platform and the users recite the original files, the sound recording files obtained by the platform are different in different user situations and are not restricted by the platform, so that contents recited by part of the users are inconsistent with the original texts. For this reason, when the audio file obtained as described above is used as a training sample, it is necessary to perform a cleaning operation before training.

At present, the cleaning operation of recording files is handled by the treater one by one, and is not only inefficient, appears easily moreover that the recording files washs the after-treatment and is not normal, file directory is irregular scheduling problem.

Disclosure of Invention

In view of the above, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for processing a sound recording file, which can improve efficiency of a processing person in cleaning a sound recording text and facilitate management and use of the sound recording file.

A sound recording file processing method comprises the following steps:

acquiring an uploaded sound recording file and an original text corresponding to the sound recording file;

calling a voice recognition interface to perform voice recognition on the recording file to obtain a recognition text;

judging whether the recognition text is consistent with the original text or not;

if the recognition text is consistent with the original text, storing the sound recording file into a preset model training set;

if the identification text is inconsistent with the original text, recording the recording file to a directory to be cleaned, and listening the recording file recorded by the directory to be cleaned by a processor and feeding back a correct recording text; the recording file recorded by the directory to be cleaned is listened by a processor and feeds back a correct recording text, and the method comprises the following steps: playing the recording file through a player arranged in the voice cleaning platform, and receiving and recording the recording text corresponding to the recording file after a processor listens the audio frequency of the played recording file; acquiring the sound recording file and the corresponding sound recording text which are cleaned in the catalog to be cleaned;

storing the cleaned sound recording file and the corresponding sound recording text in the model training set in an associated manner;

the recording the sound recording file to a directory to be cleaned comprises:

acquiring the initially determined application field of the sound recording file;

recording the sound recording file to the position of the initially determined application field in the directory to be cleaned; dividing a plurality of different positions in the directory to be cleaned and respectively recording information of the sound recording files belonging to different application fields;

the acquiring of the sound recording file and the corresponding sound recording text after being cleaned in the directory to be cleaned includes: acquiring a sound recording file cleaned in the directory to be cleaned, a sound recording text corresponding to the cleaned sound recording file and a first application field, wherein the first application field is determined by a processing person when the sound recording file is cleaned;

the step of storing the cleaned sound recording file and the corresponding sound recording text in the model training set in an associated manner specifically includes:

judging whether the first application field of the sound recording file is consistent with the initially determined application field;

if the first application field of the sound recording file is consistent with the initially determined application field, the cleaned sound recording file and the corresponding sound recording text are stored in the model training set in a correlated mode to the position of the initially determined application field;

and if the first application field of the sound recording file is inconsistent with the initially determined application field, the cleaned sound recording file and the corresponding sound recording text are stored in the position of the first application field in the model training set in an associated manner.

An audio file processing apparatus comprising:

the recording file acquisition module is used for acquiring the uploaded recording file and an original text corresponding to the recording file;

the voice recognition module is used for calling a voice recognition interface to perform voice recognition on the recording file to obtain a recognition text;

the text judgment module is used for judging whether the identification text is consistent with the original text;

the first storage module is used for storing the sound recording file into a preset model training set if the judgment result of the text judgment module is positive;

the file recording module is used for recording the recording file to a directory to be cleaned if the judgment result of the text judgment module is negative, and the recording file recorded by the directory to be cleaned is listened by a processor and feeds back a correct recording text; the recording file recorded by the directory to be cleaned is listened by a processor and feeds back a correct recording text, and the method comprises the following steps: playing the recording file through a player arranged in the voice cleaning platform, and receiving and recording the recording text corresponding to the recording file after a processor listens the audio frequency of the played recording file;

the cleaned file acquisition module is used for acquiring the cleaned sound recording files and the corresponding sound recording texts in the catalog to be cleaned;

the second storage module is used for storing the cleaned sound recording file and the corresponding sound recording text into the model training set in an associated manner;

the file recording module comprises:

the initial setting field acquisition unit is used for acquiring the initial setting application field of the sound recording file;

the first recording unit is used for recording the sound recording file to the position of the initially determined application field in the directory to be cleaned; dividing a plurality of different positions in the directory to be cleaned and respectively recording information of the sound recording files belonging to different application fields;

the cleaned file acquisition module is specifically configured to: acquiring the sound recording file cleaned in the catalog to be cleaned, a sound recording text corresponding to the cleaned sound recording file and a first application field, wherein the first application field is determined by a processing person when the sound recording file is cleaned;

the second storage module specifically includes:

the domain judging unit is used for judging whether the first application domain of the sound recording file is consistent with the initially determined application domain;

the first storage unit is used for storing the cleaned sound recording file and the corresponding sound recording text in a position to which the initially determined application field belongs in the model training set in an associated manner if the judgment result of the field judgment unit is yes;

and the second storage unit is used for storing the cleaned sound recording file and the corresponding sound recording text in a position to which the first application field belongs in the model training set in an associated manner if the judgment result of the field judgment unit is negative.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the sound recording file processing method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the steps of the above-mentioned sound recording file processing method.

According to the sound recording file processing method, the sound recording file processing device, the computer equipment and the storage medium, firstly, an uploaded sound recording file and an original text corresponding to the sound recording file are obtained; then, calling a voice recognition interface to perform voice recognition on the recording file to obtain a recognition text; judging whether the identification text is consistent with the original text; if the recognition text is consistent with the original text, storing the sound recording file into a preset model training set; if the identification text is inconsistent with the original text, recording the recording file to a directory to be cleaned, and listening the recording file recorded by the directory to be cleaned by a processing person and feeding back a correct recording text; then, acquiring the cleaned recording file and the corresponding recording text in the catalog to be cleaned; and finally, storing the cleaned sound recording file and the corresponding sound recording text in the model training set in an associated manner. Therefore, the voice recognition is carried out on the uploaded recording file by the method, whether the recognized recognition text is consistent with the original text or not can be judged before a processor cleans the recording file, the recording file with the recognized text consistent with the original text is not required to be cleaned and is directly stored in the model training set, only the recording file with the recognized text inconsistent with the original text is recorded into the directory to be cleaned and is cleaned by the processor, so that part of cleaning work of the processor can be saved, and the efficiency of cleaning the recording file by the processor is improved; and the recording text which does not need to be cleaned and the cleaned recording text are finally stored in the model training set, so that the recording file is convenient to manage and use, and a subsequent voice recognition model is convenient to train by using the recording text as a sample.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a recording file processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a recording file processing method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating step S103 of the recording file processing method in an application scenario according to an embodiment of the present invention;

fig. 4 is a schematic flow chart illustrating the focused attention content in the original text when the sound recording file processing method is used for label cleaning in an application scenario according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating the step S105 of the audio file processing method in an application scenario according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart illustrating a recording file processing method for screening a high-quality recording account in an application scenario according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an apparatus for processing audio files according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a computing device in accordance with an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The audio file processing method provided by the application can be applied to the application environment shown in fig. 1, wherein the client communicates with the server through the network. Wherein the client may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a method for processing a sound recording file is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

s101, acquiring an uploaded sound recording file and an original text corresponding to the sound recording file;

in this embodiment, the user can use the APP client to record each recording file, the APP client is docked with the server of the voice cleaning platform through the network, and each recording file is automatically uploaded to the voice cleaning platform after recording is completed, so that the voice cleaning platform can acquire the recording files.

In addition, when recording the sound recording files, the user compares the original text provided by the APP client, and may consider that the original text is the standard text of the sound recording file, and when acquiring the sound recording file, the voice cleaning platform should acquire the original text corresponding to the sound recording file.

S102, calling a voice recognition interface to perform voice recognition on the recording file to obtain a recognition text;

after the uploaded recording file is obtained, a voice recognition interface of the voice cleaning platform can be called to perform voice recognition on the recording file, and a recognition text is obtained. It can be understood that the voice cleaning platform in this embodiment may have a voice recognition function, and may also perform voice recognition on the recording files by calling voice recognition interfaces of other platforms, so as to complete the conversion of recorded voice to characters, and obtain recognized texts recognized by the recording files. It is known that, because the recognized text is obtained by using a speech recognition technology, the recognized text may be the same as or different from the original text. When the recognized text is not identical to the original text, there are at least three possibilities for reasons for the difference: 1. when recording the audio file, the user does not read according to the original text; 2. inaccurate pronunciation of the user; 3. speech recognition is subject to errors. This also proves that it is necessary to clean the sound file before training the speech recognition model using it as a sample.

S103, judging whether the identification text is consistent with the original text, if so, executing a step S104, and if not, executing a step S105;

in this embodiment, in order to reduce the workload of the processing personnel for cleaning the audio files and improve the cleaning efficiency, before the audio files are cleaned, it is first determined which text contents of the audio files are consistent with the original text. Therefore, whether the recognition text obtained through voice recognition is consistent with the original text or not can be judged, if the recognition text obtained through voice recognition is consistent with the original text, the sound recording of the recording file is accurate, the recording file can be directly used as a training sample, and a processor does not need to clean the recording file; otherwise, if the two are not consistent, it indicates that the recording file is not recorded accurately, and the recording file cannot be directly used as a training sample, and needs a handler to perform cleaning, so step S105 is executed to complete the cleaning operation.

Further, the method and the device can calculate the word error rate of the recognition text corresponding to the sound recording file when judging whether the sound recording file needs to be cleaned, so as to be used for subsequent analysis. Specifically, as shown in fig. 3, the step S103 includes:

s201, calculating the word error rate of the recognition text relative to the original text;

s202, judging whether the word error rate is 0 or not, if so, executing a step S203, and if not, executing a step S204;

s203, determining that the recognition text is consistent with the original text;

and S204, determining that the recognition text is inconsistent with the original text.

For step S201, the recognized text may be compared with the original text, and a word error rate (word error rate) of the recognized text with respect to the original text may be calculated. It can be known that the higher the word error rate is, the larger the difference between the recognized text and the original text is; the lower the word error rate, the smaller the difference between the recognized text and the original text, and when the word error rate is 0, the recognized text is consistent with the original text.

As for steps S202 to S204, it can be known that if the wrong word rate is 0, it indicates that the recognized text is consistent with the original text, and then step S104 is executed; otherwise, if the word error rate is not 0, it indicates that the recognized text is consistent with the original text, and then step S105 is performed.

S104, storing the sound recording file into a preset model training set;

with respect to step 104, it is understood that after determining that the recognized text is consistent with the original text, the audio file may be considered accurate and may be directly used for training a speech recognition model, and therefore, the audio file may be stored in a preset model training set. Specifically, the sound recording file may be stored in a designated database, or the sound recording file may be recorded in a designated training file directory, and when sample training is required, the sound recording file used as the training sample may be obtained by searching the training file directory.

S105, recording the recording file to a directory to be cleaned, wherein the recording file recorded by the directory to be cleaned is listened by a processor and feeds back a correct recording text;

with respect to step 105, it is understood that after determining that the recognized text is not consistent with the original text, a cleaning operation is required on behalf of the audio file, and therefore the audio file is recorded to the directory to be cleaned. In this embodiment, the recording files recorded in the directory to be cleaned are listened by the processing personnel and fed back correct recording texts, that is, when the processing personnel clean the recording files, the directory to be cleaned is inquired first, which recording files need to be cleaned is known, and then the recording files which need to be cleaned are obtained. Then, the processing personnel can directly play the recording files through a built-in player on the voice cleaning platform, the processing personnel can perform operations such as 'acceleration playing', 'deceleration playing', 'fast forward', 'reverse', 'pause' and the like on the recording files through the player, the voice cleaning platform provides the functions, the processing personnel can listen to the recording files conveniently, after listening to the audio frequency of the recording files, the processing personnel can record the heard texts onto the voice cleaning platform, and the recorded texts can be regarded as correct texts corresponding to the recording files, namely the recording texts.

Further, as shown in fig. 4, before executing the following step S106, the sound recording file processing method in this embodiment may further include:

s301, acquiring a sound recording file to be cleaned and a corresponding original text;

s302, the sound recording file to be cleaned and the corresponding original text are sent to voice recognition service interfaces of different platforms for voice recognition, and platform recognition texts fed back by the platforms are obtained;

s303, comparing the original text with each platform identification text respectively, and determining partial text contents in the original text, which are consistent with each platform identification text;

s304, marking text contents except the partial text contents in the original text;

s305, sending the marked original text to a designated terminal for cleaning by a processor.

It should be noted that although the cleaning of the recording files mainly depends on manual processing by a processing person, many recording files have a lot of text contents, and the number of the recording files collected by the APP client is extremely large, and most of the recording files can be used as training samples after cleaning operation. This results in an excessive workload on the treating staff, often resulting in insufficient staff. In order to alleviate such a situation and improve the cleaning efficiency of the audio file, in this embodiment, through the processing in the steps S301 to S305, before the processing personnel performs manual processing, the content that needs to be focused by the processing personnel is marked on the original text corresponding to the audio file to be cleaned in advance, so as to assist the processing personnel to listen and check the audio file and the original text in a targeted manner, thereby effectively improving the efficiency of the processing personnel cleaning the audio file.

In step S301, firstly, a directory to be cleaned may be queried, then the audio file to be cleaned is obtained, and at the same time, the original text corresponding to the audio file to be cleaned is obtained.

As for step S302, it can be understood that, because the speech recognition models adopted by different platforms are often different, the recognition results obtained by performing speech recognition on the same audio file by different platforms are also different. In the embodiment, different platforms are used for carrying out voice recognition on the same sound recording file to be cleaned to obtain the platform recognition texts fed back by each platform, and then the contents which do not need to pay attention in the text contents corresponding to the sound recording file are determined according to the comparison between the platform recognition texts and the original text.

For step S303, after obtaining each platform identification text fed back by each platform, the original text may be compared with each platform identification text, and a partial text content in the original text that is consistent with each platform identification text is determined. It can be understood that, if a part of the text content in the original text appears in the respective platform recognition texts, it may be determined that the audio of the recorded text corresponding to the part of the text content is accurate, so that the processing person does not need to pay attention to the audio corresponding to the part of the text content. From the other side, it can be found that besides the text content, there are likely to be parts to be cleaned and corrected in the audio part corresponding to other text contents in the original text, which needs to remind the processing personnel to pay attention.

As for steps S304 and S305, as described above, the audio portion corresponding to the text content in the original text may be considered as a portion that does not need to be cleaned, that is, the processing personnel does not need to pay attention to the portion, therefore, the text content other than the portion of the text content is marked with emphasis, and highlighting effects such as bold, underline, background color and the like are added, so that when the processing personnel cleans and processes the recording file, the processing personnel can listen to the recording file while checking the original text corresponding to the recording file, and pay attention to the text content marked out, so that the processing personnel can concentrate on the emphasis, the efficiency of cleaning the recording files by the processing personnel is certainly improved, the position where the "wrong word" is located is found and corrected more easily, and the processing personnel can listen to the recording file more easily and complete the cleaning work of one recording file at one time.

S106, acquiring the cleaned recording file and the corresponding recording text in the directory to be cleaned;

the recording text is determined by a processor after being cleaned, and the recording text can be considered to be consistent with the audio content in the recording file. After cleaning the recording file, the processing personnel can modify the original text or the recognized text into the recording text on the voice cleaning platform, and the voice cleaning platform can obtain the recording text corresponding to the cleaned recording file after submitting the recording text.

And S107, storing the cleaned sound recording file and the corresponding sound recording text in the model training set in an associated manner.

It will be appreciated that the cleaned sound recording file may already be used for training of the speech recognition model, and therefore the cleaned sound recording file and the corresponding sound recording text may be stored in association with the model training set.

In addition, the directory to be cleaned may classify the recording files when recording the recording files, and generally, since different speech recognition models are often distinguished according to application fields, such as finance, news, sports, movie conversations, and the like, the directory to be cleaned may also record the recording files to be cleaned according to different classifications of application fields. Further, as shown in fig. 5, the step S105 may include:

s401, acquiring the initially determined application field of the sound recording file;

s402, recording the sound recording file to a position where the initially determined application field belongs in a directory to be cleaned;

on the basis of the above steps S401 and S402, the step S106 is specifically: acquiring the sound recording file cleaned in the catalog to be cleaned, a sound recording text corresponding to the cleaned sound recording file and a first application field, wherein the first application field is determined by a processing person when the sound recording file is cleaned; further, step S107 may specifically include the following steps S501 to S503:

s501, judging whether the first application field of the sound recording file is consistent with the initially determined application field;

s502, if the first application field of the sound recording file is consistent with the initially determined application field, the cleaned sound recording file and the corresponding sound recording text are stored in the position of the initially determined application field in the model training set in an associated mode;

s503, if the first application field of the sound recording file is inconsistent with the initially determined application field, the cleaned sound recording file and the corresponding sound recording text are stored in the position of the first application field in the model training set in an associated mode.

As for steps S401 and S402, it can be understood that the application field of a sound recording file may be determined according to the original text corresponding to the sound recording file, and since the original text is provided by the voice cleansing platform, the application field may be predetermined and acquired. It should be particularly noted that if a user does not record a sound recording file according to an original text when recording the sound recording file, the sound recording file necessarily needs to be cleaned, and a processing person can listen to the sound recording file when cleaning the sound recording file and then re-determine the application field of the sound recording file. And regarding to waiting to wash the catalogue, can divide a plurality of different positions in this catalogue of waiting to wash and record the information of the recording file that belongs to different application areas respectively, so more convenient management and the application to the recording file, when the processing personnel need concentrate the recording file of washing a certain application area, can search out the recording file that belongs to this application area from the corresponding position of the catalogue of waiting to wash fast.

For this reason, on the basis of the above steps S401 and S402, the step S106 may specifically be: and acquiring the cleaned sound recording file in the directory to be cleaned, the sound recording text corresponding to the cleaned sound recording file and a first application field, wherein the first application field is determined by a processing person when the sound recording file is cleaned. That is, when the processing person cleans the recording file, the processing person listens to the recording file, and then not only feeds back the listened recording text, but also artificially determines a new application field, namely the first application field. At this time, in step S107, the cleaned sound recording file and the corresponding sound recording text are stored in association with the model training set, and in order to classify and clean the cleaned sound recording file when the sound recording file is collected in the model training set, the method may further include the steps S501-S503, first determining whether the first application field of the sound recording file is consistent with the initially determined application field, and if so, indicating that the initially determined application field before cleaning the sound recording file is accurate, so that when the cleaned sound recording file is collected, the sound recording file and the corresponding sound recording text may be stored in association with each other in a position to which the initially determined application field belongs in the model training set; otherwise, if the two are not consistent, the initial determined application field before the audio file is cleaned is not accurate, and the new first application field determined by the processing personnel during cleaning is taken as the standard, so that the new first application field and the corresponding audio file are stored in the position of the first application field in the model training set in an associated manner. Therefore, the cleaned sound recording files are divided in the application field, so that when the sound recognition model is trained by using the cleaned sound recording files as samples, the sound recording files of the target application field can be rapidly and pertinently searched out to train the sound recognition model. For example, a speech recognition model in the "car" field is currently to be trained, and each audio file can be obtained as a training sample at a position in the model training set to which the "car" field belongs.

Further, in this embodiment, since the server collects the audio files recorded by the users through the client, there may be significant differences between the audio files recorded by different users, and some users may record audio files with better quality, for example, the word error rate is very low or even no word error occurs, and some users may record audio files with poorer quality. Obviously, users with good recording quality are more popular with the server in view of cost. Therefore, in order to screen and remove users with stable and good recording quality from the majority of users, as shown in fig. 6, the method may further include:

s501, counting the word error rate of each sound recording file uploaded by the target account history;

s502, calculating the average word error rate of the target account recording sound recording files according to the counted word error rate of each sound recording file uploaded in the history;

s503, judging whether the average word error rate of the target account is smaller than a preset threshold value, if so, executing a step S504, and if not, executing a step S505;

s504, determining the target account as a high-quality recording account, wherein the high-quality recording account is rewarded by a preset incentive mechanism when uploading a recording file;

and S505, processing according to a preset flow.

In step S501, as can be seen from step S201, the server determines whether the audio file needs to be cleaned and calculates the word error rate of the recognized text corresponding to the audio file, so that the server can easily count the audio files uploaded by the user accounts in history and the word error rates of the audio files when needed.

For step S502, it can be understood that the average word error rate of the target account is an average value of word error rates of the sound files historically uploaded by the target account. For example, assuming that the target account a has historically uploaded 3 sound recording files, and the word error rates of the 3 sound recording files are 0.2, 0.1 and 0.3, respectively, the average word error rate of the target account a is calculated to be (0.2 +0.1+ 0.3)/3 =0.2.

For steps S503 to S505, in this embodiment, if the average word error rate of a certain user account is lower than a certain preset threshold, it may be considered that the recording file recorded by the user is good, and the user is a user welcomed by the server. The preset threshold may be determined according to an actual usage situation, for example, may be determined to be 0.1, that is, 10%, when the average word error rate of the target account is less than 10%, the target account may be considered as a recording account of a high-quality user, and may be determined as a high-quality recording account, in order to encourage the user of the high-quality recording account to actively upload the recording file, the server may award the recording file when or after the high-quality recording account uploads the recording file according to a preset incentive mechanism, for example, the authority of the high-quality recording account may be improved, the credit of an account system is given, and a small gift is dispatched. On the contrary, when the average word error rate of the target account is greater than or equal to 10%, the target account is considered not to be the recording account of the high-quality user, and therefore the processing can be performed according to the preset flow. The preset process may specifically be that the target account is not determined as a premium recording account.

In addition, the server in this embodiment may also check the cleaning work conditions of the processing personnel, such as the number of cleaned audio files, the number of audio files to be cleaned, and the like

In the embodiment, the uploaded recording file is subjected to voice recognition, whether the recognized recognition text is consistent with the original text or not can be judged before a processor cleans the recording file, the recording file with the recognized text consistent with the original text is not required to be cleaned and is directly stored in the model training set, and only the recording file with the recognized text inconsistent with the original text is recorded into the directory to be cleaned and is cleaned by the processor, so that part of cleaning work of the processor can be saved, and the efficiency of cleaning the recording file by the processor is improved; and the sound recording text which does not need to be cleaned and the sound recording text which is cleaned are finally stored in the model training set, so that the sound recording file is convenient to manage and use, and a subsequent voice recognition model is convenient to train by using the sound recording text as a sample.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not limit the implementation process of the embodiments of the present invention in any way.

In an embodiment, a sound recording file processing apparatus is provided, and the sound recording file processing apparatus corresponds to the sound recording file processing methods in the above embodiments one to one. As shown in fig. 7, the audio file processing apparatus includes an audio file obtaining module 601, a speech recognition module 602, a text determination module 603, a first storage module 604, a file recording module 605, a cleaned file obtaining module 606, and a second storage module 607. The functional modules are explained in detail as follows:

a recording file obtaining module 601, configured to obtain an uploaded recording file and an original text corresponding to the recording file;

the voice recognition module 602 is configured to call a voice recognition interface to perform voice recognition on the recording file to obtain a recognition text;

a text determining module 603, configured to determine whether the recognized text is consistent with the original text;

a first storage module 604, configured to store the sound recording file in a preset model training set if the determination result of the text determination module is yes;

the file recording module 605 is configured to record the recording file to a directory to be cleaned if the determination result of the text determination module is negative, and a processing person listens to the recording file recorded by the directory to be cleaned and feeds back a correct recording text;

a cleaned file obtaining module 606, configured to obtain the cleaned sound recording file and the corresponding sound recording text in the directory to be cleaned;

a second storage module 607, configured to store the cleaned sound recording file and the corresponding sound recording text in the model training set in an associated manner.

Further, the text determination module may include:

the wrong word rate calculating unit is used for calculating the wrong word rate of the recognition text relative to the original text;

the wrong word rate judging unit is used for judging whether the wrong word rate is 0 or not;

the first determining unit is used for determining that the recognition text is consistent with the original text if the judgment result of the misword rate judging unit is positive;

and the second determining unit is used for determining that the recognition text is inconsistent with the original text if the judgment result of the misword rate judging unit is negative.

Further, the file recording module may include:

the initial domain acquiring unit is used for acquiring the initial application domain of the sound recording file;

the first recording unit is used for recording the sound recording file to the position of the originally determined application field in the directory to be cleaned;

the cleaned file acquisition module is specifically used for: acquiring a sound recording file cleaned in the directory to be cleaned, a sound recording text corresponding to the cleaned sound recording file and a first application field, wherein the first application field is determined by a processing person when the sound recording file is cleaned;

the second storage module may specifically include:

a first storage unit, configured to, if the determination result of the domain determination unit is yes, store the cleaned audio file and the corresponding audio text in association with the position to which the application domain originally determined in the model training set belongs;

Further, the sound recording file processing apparatus may further include:

the file text acquisition module is used for acquiring the audio file to be cleaned and the corresponding original text;

the platform recognition module is used for sending the sound recording file to be cleaned and the corresponding original text to the voice recognition service interfaces of different platforms for voice recognition to obtain recognition texts of the platforms fed back by the platforms;

the platform identification text comparison module is used for comparing the original text with each platform identification text respectively and determining partial text contents in the original text, which are consistent with each platform identification text;

the text labeling module is used for labeling the text contents except the partial text contents in the original text;

and the cleaning and sending module is used for sending the marked original text to a designated terminal so as to be cleaned by a processor.

Further, the sound recording file processing apparatus may further include:

the statistical module is used for counting the word error rate of each sound recording file uploaded by the target account history;

the average wrong word rate calculation module is used for calculating the average wrong word rate of the recording sound files recorded by the target account according to the counted wrong word rate of each recording sound file uploaded historically;

the average wrong word rate judging module is used for judging whether the average wrong word rate of the target account is smaller than a preset threshold value or not;

and the high-quality account determining module is used for determining the target account as a high-quality recording account if the judgment result of the average word-miss rate judging module is positive, and the high-quality recording account is rewarded by a preset incentive mechanism when uploading the recording file.

For the specific limitations of the sound recording file processing apparatus, reference may be made to the above limitations on the sound recording file processing method, which are not described herein again. All or part of the modules in the sound recording file processing device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is used for storing the data involved in the sound recording file processing method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sound recording file processing method.

In one embodiment, a computer device is provided, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the steps of the sound recording file processing method in the foregoing embodiments, such as steps S101 to S107 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the sound recording file processing apparatus in the above-described embodiment, for example, the functions of the modules 601 to 607 shown in fig. 7. To avoid repetition, further description is omitted here.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the sound recording file processing method in the above-described embodiments, such as the steps S101 to S107 shown in fig. 2. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the sound recording file processing apparatus in the above-described embodiments, such as the functions of the modules 601 to 607 shown in fig. 7. To avoid repetition, further description is omitted here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for processing a sound recording file is characterized by comprising the following steps:

judging whether the identification text is consistent with the original text;

if the identification text is inconsistent with the original text, recording the recording file to a directory to be cleaned, and listening the recording file recorded by the directory to be cleaned by a processing person and feeding back a correct recording text; the recording file recorded by the directory to be cleaned is listened by a processor and feeds back a correct recording text, and the method comprises the following steps: playing the recording file through a player arranged in the voice cleaning platform, and receiving and recording the recording text corresponding to the recording file after a processor listens the audio frequency of the played recording file;

acquiring the sound recording file and the corresponding sound recording text which are cleaned in the catalog to be cleaned;

the recording the sound recording file to a directory to be cleaned comprises:

the storing the cleaned sound recording file and the corresponding sound recording text in the model training set in an associated manner specifically comprises:

2. The method of claim 1, wherein the determining whether the recognized text is consistent with the original text comprises:

calculating the word error rate of the recognized text relative to the original text;

judging whether the word error rate is 0 or not;

if the word error rate is 0, determining that the recognition text is consistent with the original text;

and if the wrong word rate is not 0, determining that the recognition text is inconsistent with the original text.

3. The method for processing the audio file according to claim 1, before obtaining the audio file and the corresponding audio text that are washed in the directory to be washed, further comprising:

acquiring a record file to be cleaned and a corresponding original text;

sending the sound recording file to be cleaned and the corresponding original text to voice recognition service interfaces of different platforms for voice recognition to obtain platform recognition texts fed back by the platforms;

respectively comparing the original text with each platform identification text, and determining partial text contents in the original text, which are consistent with each platform identification text;

labeling text contents except the partial text contents in the original text;

and sending the marked original text to a designated terminal for cleaning by a processor.

4. The audio record file processing method according to any one of claims 1 to 3, further comprising:

counting word error rates of all the sound recording files uploaded by the target account history;

calculating the average word error rate of the target account recording sound recording files according to the word error rate of each sound recording file uploaded in the history obtained through statistics;

judging whether the average word error rate of the target account is smaller than a preset threshold value or not;

and if the average word error rate of the target account is smaller than a preset threshold value, determining the target account as a high-quality recording account, and rewarding by a preset incentive mechanism when the high-quality recording account uploads the recording file.

5. An audio file processing apparatus, comprising:

the file recording module comprises:

the first recording unit is used for recording the sound recording file to the position of the initially determined application field in the directory to be cleaned; dividing a plurality of different positions in the catalog to be cleaned and respectively recording information of the sound recording files belonging to different application fields;

the second storage module specifically includes:

6. The apparatus for processing audio file according to claim 5, wherein the text determining module comprises:

the word error rate calculation unit is used for calculating the word error rate of the recognition text relative to the original text;

the word error rate judging unit is used for judging whether the word error rate is 0 or not;

a first determining unit, configured to determine that the recognition text is consistent with the original text if a determination result of the wrong word rate determining unit is yes;

7. Computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the sound recording file processing method according to any of claims 1 to 4 when executing the computer program.

8. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the sound recording file processing method according to any one of claims 1 to 4.