CN112559798A

CN112559798A - Method and device for detecting quality of audio content

Info

Publication number: CN112559798A
Application number: CN201910922694.7A
Authority: CN
Inventors: 陈佳豪; 丁文彪; 刘子韬
Original assignee: Beijing Xintang Sichuang Educational Technology Co Ltd
Current assignee: Beijing Xintang Sichuang Educational Technology Co Ltd
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2021-03-26
Anticipated expiration: 2039-09-26
Also published as: CN112559798B; WO2021057270A1

Abstract

The embodiment of the invention provides an audio content quality detection method and device. The method comprises the following steps: acquiring a trigger sentence matched with the target sentence set in target audio according to the target sentence set; extracting a segment where the trigger sentence is located in the target audio; adding the segment where the trigger sentence is located into a result set of the target audio; and determining the target audio content quality according to the segments in the result set. The audio content quality detection method and device provided by the embodiment of the invention can automatically score the target audio, reduce the labor cost and improve the accuracy of quality judgment.

Description

Method and device for detecting quality of audio content

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for detecting the quality of audio content.

Background

With the development of science and technology, many new teaching modes begin to appear. On-line one-to-one teaching is widely concerned, and the interaction between teachers and students can get rid of the regional limitation. However, the teaching quality of on-line teaching is more difficult to measure.

In a teaching scenario, many teaching activities can improve the teaching quality, for example, teachers let students teach their own wrong questions and the like. In online teaching, it is difficult to detect whether a teacher performs some teaching operations that improve teaching quality during the course of teaching. In the prior art, a questionnaire is issued to students for inspection after a lecture, or a recorded teaching video is manually checked. The former method is subjective and takes up time of students. The time cost of the latter is extremely large, and when the number of courses is increased, the labor cost is increased sharply.

Disclosure of Invention

The embodiment of the invention provides a method and a device for detecting the quality of audio content, which are used for solving one or more technical problems in the prior art.

In a first aspect, an embodiment of the present invention provides an audio content quality detection method, including:

acquiring a trigger sentence matched with the target sentence set in target audio according to the target sentence set;

extracting a segment where the trigger sentence is located in the target audio by the trigger sentence;

adding the segment where the trigger sentence is located into a result set of the target audio;

and determining the target audio content quality according to the segments in the result set.

In one embodiment, before the step of obtaining a trigger sentence in the target audio that matches the target sentence set, the method further comprises:

searching a second sentence, the similarity of which with a first sentence in the first sentence set reaches a first set threshold value, in the corpus;

if the second sentence does not exist in the first sentence set, taking the second sentence as a new first sentence, adding the new first sentence into the first sentence set, and returning to the step of searching the corpus for the second sentence with the similarity reaching a first set threshold value with the first sentence in the first sentence set; and if the first sentence set comprises the second sentence, constructing a target sentence set according to the first sentence set, wherein sentences in the target sentence set are target sentences.

In one embodiment, the searching, in the corpus, for a second sentence having a similarity to a first sentence in the first sentence set reaching a first set threshold includes:

performing word segmentation processing on the first sentence to obtain a first vector representation of the first sentence; performing word segmentation processing on each sentence in the corpus to obtain a second vector representation of each sentence;

calculating the similarity of the first sentence and each sentence in the corpus according to the first vector representation and the second vector representation;

and searching the second sentence according to the similarity and the first set threshold.

In one embodiment, after the constructing the target sentence set, the method further includes:

acquiring mode description representing the common characteristics of the target sentences in the target sentence set according to the target sentence set;

obtaining a regular expression of the target sentence set according to the pattern description;

the obtaining of the trigger sentence matched with the target sentence set in the target audio includes:

and acquiring a trigger sentence matched with the regular expression in the target audio.

In one embodiment, obtaining a regular expression of the target sentence set according to the pattern description includes:

obtaining a candidate regular expression by using the pattern description;

selecting a candidate set from a detection sample according to the candidate regular expression, wherein the detection sample comprises a plurality of sample sentences;

and screening the candidate regular expressions according to the similarity between the sample sentences in the candidate set and the target sentences to obtain the regular expressions.

In one embodiment, extracting a segment of the target audio where the trigger sentence is located includes:

taking the trigger sentence as a starting point, and sequentially extracting sentences of a first set number according to a time sequence to be used as corresponding segments of the trigger sentence in the target audio;

or, taking the trigger sentence as a starting point, taking a blank which appears for the first time in the subsequent time and is longer than the set time length as an end point, and extracting a sentence between the starting point and the end point as a corresponding segment of the trigger sentence in the target audio frequency;

or, taking the trigger sentence with the earliest time sequence as a starting point from among more than one trigger sentences with the lowest second set number, extracting sentences with the first set number according to the time sequence after the trigger sentence with the latest time sequence, and forming corresponding segments of the trigger sentences in the target audio together with sentences among the more than one trigger sentences with the lowest second set number, wherein the number of sentences among two adjacent trigger sentences is less than the first set number.

In one embodiment, determining the target audio content quality from segments in the result set comprises:

extracting statistical characteristics of the fragments;

calculating the fraction of the segment according to the weight given to the statistical characteristics of the segment in advance;

and determining the target audio content quality according to the scores of all the segments in the result set.

In a second aspect, an embodiment of the present invention further provides an apparatus for detecting quality of audio content, including:

a trigger sentence detection module: the method comprises the steps of acquiring a trigger sentence matched with a target sentence set in target audio according to the target sentence set;

a fragment extraction module: the segment where the trigger sentence is located in the target audio is extracted;

a result set generation module: the segment where the trigger sentence is located is added into the result set of the target audio;

a quality determination module: for determining the target audio content quality from segments in the result set.

In one embodiment, the apparatus further comprises:

the second sentence searching module: the method comprises the steps of searching a second sentence, the similarity of which with a first sentence in a first sentence set reaches a first set threshold value, in a corpus;

a target sentence adding module: if the second sentence does not exist in the first sentence set, the second sentence is used as a new first sentence, the new first sentence is added into the first sentence set, and the second sentence searching module is triggered again;

the target sentence set construction module: and if the first sentence set comprises the second sentence, constructing a target sentence set according to the first sentence set, wherein sentences in the target sentence set are target sentences.

In one embodiment, the second sentence lookup module comprises:

a first vector unit: the word segmentation processing is carried out on the first sentence to obtain a first vector representation of the first sentence;

a second vector unit: the word segmentation processing unit is used for carrying out word segmentation processing on each sentence in the corpus to obtain a second vector representation of each sentence;

a similarity calculation unit: the similarity between the first sentence and each sentence in the corpus is calculated according to the first vector representation and the second vector representation;

a similarity processing unit: and searching the second sentence according to the similarity and the first set threshold.

In one embodiment, the apparatus further comprises:

a pattern extraction module: the mode description is used for acquiring the common characteristics representing the target sentences in the target sentence set according to the target sentence set;

a regular expression obtaining module: the regular expression used for obtaining the target sentence set according to the mode description;

the trigger sentence detection module is further configured to:

In one embodiment, the regular expression acquisition module comprises:

a candidate regular expression acquisition unit: the regular expression model is used for utilizing the pattern description to obtain a candidate regular expression;

a candidate set detection unit: selecting a candidate set from a detection sample according to the candidate regular expression, wherein the detection sample comprises a plurality of sample sentences;

screening unit: and the regular expression screening module is used for screening the candidate regular expressions according to the similarity between the sentences in the candidate set and the target sentences to obtain the regular expressions.

In one embodiment, the segment extraction module comprises:

a first extraction unit: the voice recognition device is used for taking the trigger sentence as a starting point, sequentially extracting sentences of a first set number according to a time sequence, and taking the sentences as corresponding segments of the trigger sentence in the target audio;

or, the second extraction unit: the voice recognition device is used for taking the trigger sentence as a starting point, taking a blank which appears for the first time in the subsequent time and is longer than the set time length as an end point, extracting sentences between the starting point and the end point, and taking the sentences as corresponding segments of the trigger sentence in the target audio;

or, the third extraction unit: the method is used for taking the trigger sentence with the earliest time sequence as a starting point from the trigger sentences with the earliest time sequence and the latest time sequence as the starting point, and the sentences with the first set number are extracted from the trigger sentences with the latest time sequence and the sentences among the trigger sentences with the more than one trigger sentences and the triggering sentences with the lower than the second set number to form corresponding segments of the trigger sentences in the target audio, wherein the number of the sentences among two adjacent trigger sentences is smaller than the first set number.

In one embodiment, the quality determination module comprises:

a feature extraction unit: extracting statistical characteristics of the fragments;

a score calculation unit: the score of the segment is calculated according to the weight given to the statistical characteristics of the segment in advance;

mass unit: for determining the target audio content quality based on the scores of the segments in the result set.

In a third aspect, an embodiment of the present invention provides an apparatus for detecting quality of audio content, where the function of the apparatus may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions.

In one possible design, the apparatus includes a processor and a memory, the memory is used for storing a program supporting the apparatus to execute the above-mentioned audio content quality detection method, and the processor is configured to execute the program stored in the memory. The apparatus may also include a communication interface for communicating with other devices or a communication network.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium for storing computer software instructions for an audio content quality detection apparatus, which includes a program for executing the audio content quality detection method.

One of the above technical solutions has the following advantages or beneficial effects: the embodiment of the invention can identify the trigger sentence in the audio, then extracts the corresponding segment according to the trigger bureau for scoring, and evaluates the quality of the audio content, thereby automatically identifying a large amount of audio, reducing the labor cost and having higher quality detection accuracy.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

Fig. 1 shows a flow chart of an audio content quality detection method according to an embodiment of the present invention.

Fig. 2 shows a flow chart of an audio content quality detection method according to an embodiment of the present invention.

Fig. 3 shows a partial flow diagram of an audio content quality detection method according to an embodiment of the invention.

Fig. 4 shows a partial flow diagram of an audio content quality detection method according to an embodiment of the invention.

Fig. 5 shows a block diagram of the structure of an audio content quality detection apparatus according to an embodiment of the present invention.

Fig. 6 shows a block diagram of the structure of an audio content quality detection apparatus according to an embodiment of the present invention.

Fig. 7 shows a block diagram of the structure of an audio content quality detection apparatus according to an embodiment of the present invention.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 shows a flow chart of an audio content quality detection method according to an embodiment of the present invention. As shown in fig. 1, the audio content quality detection method includes:

step S11: and acquiring a trigger sentence matched with the target sentence set in the target audio according to the target sentence set.

Step S12: and extracting a corresponding segment of the trigger sentence in the target audio.

Step S13: and adding the segment where the trigger sentence is located into the result set of the target audio.

Step S14: and determining the target audio content quality according to the segments in the result set.

In the embodiment of the present invention, the target sentence set may include a plurality of sentences, the target audio is an audio to be subjected to audio content quality detection, and the audio content quality detection refers to detecting a case where the target audio includes a preset sentence content (or a preset audio content). The trigger sentence matched with the target sentence set can be a trigger sentence with a corresponding relation between the target audio and any sentence in the target sentence set. The audio content quality detection method provided by the embodiment can be applied to teaching scenes. The target audio may include audio obtained by recording the teaching process, or may include audio separated from the teaching video. Besides teaching scenes, the audio content quality detection method provided by the embodiment of the invention can also be used for other scenes needing to detect the audio content quality.

The trigger sentence in the embodiment of the present invention may be a sentence that is the same as or has a higher similarity to the sentences in the target sentence set.

In the embodiment of the present invention, the segment where the trigger sentence is located in the target audio may be a part of the target audio, for example, a part of the target audio that includes a part of the audio corresponding to the trigger sentence. The result set may be a set of all fragments that contain a trigger sentence. And determining the target audio content quality according to the segments in the result set, wherein the segments in the result set are scored or graded, and the target audio content quality is determined according to the scoring result or the grading result.

For example, the trigger sentence includes "please say the meaning of this paragraph", the section corresponding to the target audio includes the trigger sentence and the communication between the teacher and the student regarding the trigger sentence, the full score is 1, and the section is scored into 0.8. And evaluating the quality of the target audio content as excellent or good according to the scoring result.

In some embodiments of the present invention, determining the target audio content quality may be directly determining the target audio content quality according to the segment corresponding to the trigger sentence, and does not need to determine the target audio content quality by scoring.

In some embodiments, the target sentence is a reference point for detecting the quality of the target audio content, e.g., the detector wants to know whether the content related to the target sentence exists in the target audio. In practical applications, there may be many segments related to the target sentence in one target audio, and there may be no content related to the target sentence in the whole target audio. Meanwhile, due to the diversity of language expression modes, one target sentence may have multiple expression modes, and the multiple expression modes may be included in the target sentence set, so that multiple trigger sentences related to at least one sentence in the target sentence set may exist in one section of target audio, and multiple segments corresponding to the target sentence set may also exist correspondingly.

For example, in a teaching scene, the target sentence set includes { please say a wrong question, please say a place where an error occurs, please say where an error occurs, and say a place where an error occurs, }, a first trigger sentence "tells a place where an error occurs" exists in the target audio, a second trigger sentence "where you have an error" exists, and a third trigger sentence "says a place where an error occurs", considering that there may be repeated utterances in an actual conversation, there are three trigger sentences related to the target sentence set in the target audio, and then there may be three segments at most in the target audio, and these three segments are added to the result set of the target audio as a basis for detecting the content quality of the target audio.

According to the embodiment of the invention, the trigger sentence is searched in the target audio according to the target sentence set, and the segment corresponding to the trigger sentence is screened out, so that the content quality of the target audio is analyzed according to the segment, the analysis range is narrowed, the labor cost is reduced, the number of detections is increased, and the efficiency of audio quality analysis is improved; in addition, the accuracy and objectivity of evaluation are guaranteed through automatic detection of the quality of the audio content.

In one embodiment, before the step of obtaining the trigger sentence in the target audio, which matches the target sentence set, as shown in fig. 2, the method further includes:

step S21: and searching a second sentence with the similarity reaching a first set threshold value with the first sentence in the first sentence set in the corpus.

Step S22: if the second sentence does not exist in the first sentence set, the second sentence is used as a new first sentence and added into the first sentence set, and the step S21 is returned to repeatedly execute the step of searching the second sentence, the similarity of which with the first sentence in the first sentence set added with the new first sentence reaches the first set threshold value, in the corpus.

Step S23: and if the first sentence set comprises the second sentence, constructing a target sentence set according to the first sentence set, wherein sentences in the target sentence set are target sentences.

In an embodiment of the present invention, when returning to step S21 to repeatedly perform the search in the corpus, step S22 may further include: only searching a second sentence with the similarity reaching a second set threshold value with the new first sentence in the first sentence set in the corpus; it is further determined whether the second sentence is present in the first sentence set, and steps S22 and S23 are performed until duplication is removed and no new second sentence is added in the first sentence set.

In the embodiment of the invention, the corpus is constructed according to the collected text corpora. The text corpus can be specifically acquired according to teaching videos or teaching audios except the target audio.

In the embodiment of the invention, the method is applied to a teaching scene. The text corpus can be collected from previously stored teaching videos or teaching audios. The collected text corpus mainly comprises teaching voice and the like in a teaching scene.

By constructing the target sentence set, sentences that are relatively similar to the target sentence can be added to the target sentence set as other target sentences. Because multiple expression modes can exist in the same meaning in the language, in the embodiment of the invention, the target sentence set is constructed and used as the reference standard for extracting the segments from the target audio, so that the accuracy of paragraph extraction can be improved, and the accuracy of related segment detection can also be improved.

and searching the second sentence in the corpus according to the similarity and the first set threshold value.

In a specific example, performing word segmentation on the first sentence to obtain a first vector representation of the first sentence; and performing word segmentation processing on each sentence in the corpus to obtain a second vector representation of each sentence, including:

performing word segmentation processing on the first sentence to obtain a first word vector of each word of the first sentence;

averaging vectors of each first word vector to obtain first vector representation of the first sentence;

similarly, word segmentation processing is carried out on each sentence in the corpus to obtain a second word vector of each word of the corresponding sentence;

and averaging the second word vectors to obtain a second vector representation of each sentence in the corpus.

For example, the first sentence is: "good weather today". In the corpus, a sentence 'good weather today' exists, the similarity of two sentences of 'good weather today' and 'good weather today' calculated according to the similarity reaches a first set threshold value, the 'good weather today' is a second sentence, the second sentence and the first sentence are added into a target sentence set, when a trigger sentence in target audio is obtained, the trigger sentence is detected in the target audio in sequence according to each target sentence in the target sentence set, and if a sentence matched with any target sentence in the target sentence set exists in the target audio, the sentence is determined to be the trigger sentence.

In one embodiment, calculating a similarity between the target sentence and each sentence in the corpus based on the first vector representation and the second vector representation comprises:

calculating the similarity of the first vector representation and the second vector representation by using the following formula to obtain the similarity of the first sentence and each sentence in the corpus:

wherein S is the similarity between A and B, A is the first vector representation, B is the second vector representation, n is the number of participles, i is the ith participle, A is the number of the participles_iIs the i-th word vector of the target sentence, B_iIs the ith word vector of a sentence in the corpus.

When comparing the target sentence with each sentence in the corpus, the similarity can be compared for each sentence in the corpus by adopting the method.

In one embodiment, after the constructing the target sentence set, the method further comprises:

The mode description of the common characteristics in the embodiment of the invention can be the common characteristics of a class of sentences with similar or identical meanings, and the same meaning can have various forms in the aspect of language expression. For example, "please say what the question is wrong", "please say the reason why the question is wrong", "what the reason why the question is wrong", and "what the question is wrong". The common characteristic of the target sentence is extracted, other sentences similar to the target sentence can be searched by utilizing the common characteristic, and therefore the trigger sentence can be accurately searched in the target audio.

In the embodiment of the present invention, the mode description of the target sentence may be a sentence construction mode of the target sentence, including a keyword and other factors. Regular expressions, also known as regular expressions, are often used to retrieve and replace text that conforms to a pattern or rule.

In one example of the present invention, the process from building a set of target sentences from a first sentence to extracting a schema description of the target sentences includes the steps as shown in FIG. 3:

step S31: and performing word segmentation processing on the first sentence to obtain a first word vector of each word of the first sentence. For example, the present example is applied to a teaching scenario, and the first sentence is "analyze the reason for this wrong question". And then, segmenting each sentence, and averagely obtaining the vector representation of the sentence by the word vector of each word. As shown in fig. 4, in one possible example, the first sentence "analyze the reason for the wrong question at once" is subject to the word segmentation process, resulting in the words "analyze", "at once", "this", "wrong question", "of", "reason". A first word vector for each word is further obtained, e.g., assuming that the first word vector for each word is [0.2, 0.5, 0.4.

Step S32: and averaging the first word vectors to obtain a first vector representation of the first sentence. On the basis of the assumptions of fig. 4, the first word vector is averaged, resulting in a first vector representation of the first sentence as [0.2, 0.5, 0.4.

Step S33: a corpus is obtained.

Step S34: and performing word segmentation processing on each sentence in the corpus to obtain a second word vector of each word in the corpus.

Step S35: and averaging the second word vectors to obtain a second vector representation of each sentence in the corpus.

Step S36: and calculating the similarity of the target sentence and each sentence in the corpus according to the first vector representation and the second vector representation. In one example, assuming that the first vector is denoted as a and the second vector is denoted as B, the similarity may be cosine values of the two vectors. Namely:

s is the similarity, and theta is the included angle of the first vector and the second vector.

Step S37: and adding sentences the similarity of which reaches a first set threshold into a preselected set. The first set threshold may be a fraction close to 1, for example 0.9.

Step S38: the steps S31-S37 are repeated for each sentence in the preselected set as the first sentence until the number of sentences in the preselected set no longer increases. In performing step S38, not only are sentences added to the preselected set, but also duplicate sentences in the preselected set are removed. For example, the pre-selected set is finally obtained as { analyze the reason of the wrong question, analyze the reason of the wrong you, what the reason of the wrong you is, why the wrong you were analyzed }.

Step S39: and extracting common features from the sentences in the preselected set to obtain the mode description of the target sentence. And further following the pattern description of the sentence to obtain a regular expression.

In the embodiment of the invention, the regular expression contains the characteristics of sentences, is an inductive summary of the characteristics of a class of sentences expressing the same or similar sentences, and can screen out trigger sentences with the same or similar meanings from the target audio according to the regular expression, thereby realizing the automatic acquisition of audio segments and facilitating the subsequent scoring and evaluation.

obtaining a candidate regular expression by using the pattern description;

In an embodiment of the present invention, obtaining the candidate regular expression by using the pattern description of the target sentence may include taking the pattern description of the target sentence as the candidate regular expression.

In the embodiment of the present invention, the detection sample may be obtained from a large amount of teaching audio data or video data other than the target audio.

In one embodiment of the present invention, the plurality of sample sentences in the detection sample may be sample sentences having similarity labels. The similarity label is used for representing the similarity between the sample sentence and the target sentence in the target sentence set, and the similarity label can be labeled in advance in a manual labeling mode. Then, according to the similarity between the sample sentence in the candidate set and the target sentence, the candidate regular expression is screened, which may be: according to the similarity labels of the sample sentences in the candidate set, the similarity between the sample sentences in the candidate set and the target sentences is obtained, the candidate regular expressions are screened, the sample sentences with the similarity lower than a first set threshold in the candidate set and the expressions corresponding to the sample sentences with the similarity lower than the first set threshold in the candidate regular expressions are removed, so that the regular expressions are obtained, or the expressions corresponding to the sample sentences left after the sample sentences with the similarity lower than the first set threshold in the candidate set are removed, so that the regular expressions are obtained.

In an embodiment of the present invention, the screening of the candidate regular expressions according to the similarity between the sample sentence and the target sentence in the candidate set may further be: and according to the calculation formula of the similarity S in the step S36, calculating the similarity between the sample sentence and the target sentence in the candidate set, and screening the candidate regular expressions. The screening process is similar to the description of the previous embodiment, and is not repeated herein.

In an embodiment of the present invention, the mode description or the candidate regular expression may be further filtered through a preset first model to obtain the regular expression.

Specifically, in one example, the training process of the first model includes: and taking the pattern description of the target sentence or a candidate regular expression obtained based on the pattern description as the input of the first model, taking the label of the pattern description or the label of each expression in the candidate regular expression as the output of the first model, and training the first model.

It should be noted that the labels of the pattern descriptions represent whether each pattern description can accurately express the common features of the target sentence, and the labels of each expression in the candidate regular expressions also represent whether each expression can accurately express the common features of the target sentence. The label of the expression is 1, which indicates that the expression can accurately express the common characteristics of the target sentence.

In practical use, the pattern description or the candidate regular expression can be screened through the first model to obtain the regular expression.

In one example, the training process of the first model may also be: in the mode description of a target sentence or a candidate regular expression obtained based on the mode description and a detection sample (such as sample audio), a candidate sample set obtained after the detection of the candidate regular expression is used as the input of a first model, a sample sentence set with a label of 1 in the candidate sample set is used as the output, and the first model is trained.

And the label of the sample sentence represents whether the similarity between the sample sentence and the target sentence meets a preset value, if so, the label is 1, otherwise, the label is 0.

In practical use, the candidate set obtained based on the pattern description or the candidate regular expression can be further filtered through the first model to obtain a training set, and then the pattern description or the candidate regular expression is screened based on the training set to obtain the regular expression. The screening process is similar to the description of the previous embodiment, and is not repeated herein.

Specifically, in one example, the processing of the data by the first model includes the steps shown in fig. 4:

step S41: and obtaining a candidate regular expression by using the pattern description of the target sentence. For example, in one example, the candidate regular expressions are: the cause of the error [ \ w ], |? L. {0, 1} [ \\ w ] -, and "error [ \ w ] -, |? L. {0, 1} [ \\ w ]. why }. "\ w" represents the replacement of any Chinese character and "|" represents the logical OR. "0, 1" means that it can occur 1 or 0 times. "book" means a logical and.

Step S42: and screening the detection samples such as texts of sample audios by using the candidate regular expressions to obtain a candidate set.

Step S43: and judging whether the similarity between the sentences in the candidate set and the first sentence meets a preset value or not, and obtaining a judgment result. The judgment can be carried out in a semantic comparison mode.

Step S44: and screening the candidate regular expressions according to the judgment result, and removing the candidate regular expressions with low accuracy and sentences selected by the candidate regular expressions with low accuracy to obtain a training set.

Step S45: and screening the candidate regular expressions based on the training set to obtain the regular expressions.

In an embodiment of the present invention, after the regular expression is obtained, a training set may be obtained based on the obtained regular expression. Specifically, for example, a sentence conforming to a regular expression is obtained in one sample by using the regular expression.

In another embodiment of the invention, the sentences which are obtained in the sample by using the regular expression and conform to the regular expression can be checked, and whether the meanings of the sentences are consistent with the meanings of the target sentences to be expressed or not can be judged.

Step S451: extracting text features from sentences in the training set, wherein the extracted text features comprise: bag of words (Bag of Word), Term Frequency-Inverse Document Frequency (TF-IDF, Term Frequency-Inverse Document Frequency), and Term vector. The three text features are vectors, for example, the bag-of-words vector is [0, 1, 2 … … 0 ]. The TF-IDF vector is [0.4, 0.4, 0.2 … … 0.5.5 ], and the term vector is [0.4, 0.4, 0.2 … … 0.5.5 ]. And (3) performing dimensionality reduction on the vector needing dimensionality reduction, for example, performing PCA (Principal Component Analysis) dimensionality reduction on the bag-of-word vector when the dimension of the bag-of-word vector is higher to obtain a vector [0.4, 0.4, 0.2 … … 0.5.5 ]. And obtaining the feature combination of the training set according to the three text features.

Step S452: the GBDT (Gradient Boosting Decision Tree) is trained using feature combinations. And inputting the feature combination of the training set into the GBDT, and training the GBDT by using the feature combination of the training set screened by the obtained accurate regular expression, so that the GBDT can obtain texts corresponding to the target audio according to the regular expression for filtering, and further obtain trigger sentences in the texts.

The specific GBDT training process may be: and inputting the feature combination and the corresponding label into the GBDT model, and learning by the GBDT model according to the comparison condition of the feature combination judgment result and the label. The tag may specifically be "yes", that is, the sentence corresponding to the corresponding feature combination conforms to the regular expression, and the expressed meaning is consistent with or similar to the target sentence.

In the prediction phase, the GBDT model may still determine, according to the feature combination of the input text, whether a sentence corresponding to the feature combination conforms to the regular expression, thereby outputting "yes" or "no" to indicate whether the corresponding sentence is a wake-up sentence.

or, taking the trigger sentence with the earliest time sequence among more than one trigger sentences with the lowest second set number as a starting point, extracting sentences with the first set number according to the time sequence after the trigger sentence with the latest time sequence, and forming corresponding segments of the trigger sentences in the target audio together with sentences among the more than one trigger sentences with the lowest second set number, wherein the number of sentences among two adjacent trigger sentences is less than the first set number.

In one example, each trigger sentence is used as a starting point, and in 30 sentences after the time point of the trigger sentence, if a sentence with a final pause time of more than 30s exists, the sentence is marked as an end point, and sentences between the starting point and the end point are extracted to obtain corresponding segments of the trigger sentence in the target audio.

In one example, the start and end points of the target audio text are labeled as shown in the following table:

TABLE 1

In one embodiment, the trigger sentences include a plurality of sentences, and the number of sentences between adjacent trigger sentences is smaller than a set value.

In one embodiment, extracting a corresponding segment of the trigger sentence in the target audio includes:

and taking the trigger sentence with the earliest time sequence as a starting point from more than one trigger sentences with the number less than the second set number, extracting the sentences with the set number according to the time sequence after the trigger sentence with the latest time sequence, and forming corresponding segments of the trigger sentences in the target audio together with the sentences among the more than one trigger sentences with the number less than the set number.

For example, in an example, in 30 words extracted from the trigger a, if another trigger b exists in the subsequent sentence, 30 words are extracted from b, the position of the last word in the 30 words is used as a position c, an unidentified initial paragraph is formed between the starting point a and the position c, and if a blank space between the starting point a and the position c is greater than 30s, the blank space is considered as an end point. If there is no blank, position c is considered to be the end point. The number of trigger sentences between the starting point and the end point does not exceed a set value, for example, 3 times. If a trigger sentence d still exists between b and c in the above example, 30 sentences are extracted again from d, the extraction frequency reaches three times, and even if a trigger sentence e still exists in 30 sentences after d, the trigger sentence e is not extracted from e. The time interval between the start and end points does not exceed 5 minutes.

For the extracted segment, extracting a text feature for each sentence in the segment, wherein the extracted text feature comprises: word bag, word frequency-reverse document frequency, word vector. The process of extracting the segment where the trigger sentence is located in the target audio may be performed by an LSTM (Long Short-Time Memory, Long Short-term Memory neural network) model.

extracting statistical characteristics of the fragments;

In one example, the statistical characteristics of the segments may include any combination of:

query _ num: the number of questions asked;

verb _ num: number of verbs;

text _ num _ sum: the sum of the number of words;

text _ num _ std: standard deviation of word count;

text _ num _ mean: averaging the number of words per sentence;

x _ num: the number of occurrences of the terms "kame", "o", "j";

spread _ duration: the time of speaking;

sensor _ duration: oftentimes of paragraphs;

spread _ duration _ rate: the proportion of speech (the frequency of speech/the frequency of paragraphs);

slient _ duration _ rate: the time of silence (1-spread _ duration).

By setting the statistical characteristics, weighting is given to the statistical characteristics in advance, and the score of the segment is calculated based on the weighting given in advance, whereby the segment can be objectively scored and graded. In a specific example, each statistical feature may be normalized, given a random weight, multiplied by the feature and summed to obtain a paragraph score, and then sorted by score to obtain a new sort. In another embodiment, the NDCG (Normalized discrete Cumulative Gain) index may be used to measure the effect, so as to obtain the best weight value. The NDGG is calculated as follows:

wherein the content of the first and second substances,

DGG is the discrete Cumulative Gain. rel_kIs the fraction of the kth segment. P is the total number of fragments, and k represents the kth fragment. IDGG is the best case DGG, i.e., the DGG calculated in terms of the segment true ordering.

In the embodiment of the present invention, the statistical features may be calculated to obtain feature vectors of the statistical features, and a logistic Regression model (logistic Regression) is trained using the feature vectors to obtain optimal weights among the features. The optimal weight can also be obtained by classification models such as an SVM (Support Vector Machine), GBDT (guaranteed bit rate), and the like.

In specific implementation, the segments obtained according to the trigger sentence in the target audio may be scored according to the set weight, so as to obtain a highest score max _ score and a lowest score min _ score in all the segments.

For each segment in the result set of target audio, a score t _ score is obtained from the product of the weights and the statistical features, and then p _ score is obtained as follows:

if t _ score < min _ score then: p _ score ═ 0;

if min _ score ≦ t _ score ≦ max _ score, then: p _ score ═ t _ score-min _ score/(max _ score-min _ score);

p _ score is 1 if p _ score > max _ score.

It can be seen that the calculated score is a numerical value between 0 and 1. The scores are mapped to corresponding levels as shown in table 2 below:

score of	Rank of
		0≤p_score＜0.2	1
0.2≤p_score＜0.5	2
		0.5≤p_score＜0.8	3
0.8≤p_score≤1	4

TABLE 2

In an example, each segment related to a trigger sentence is scored, and a result is output, which may specifically be output according to the following format:

where start _ ts _ ms denotes a segment start time, end _ ts _ ms denotes an end time, score denotes a score, and index denotes a number of segments associated with a trigger sentence.

The present invention also provides an audio content quality detection apparatus, the structure of which is shown in fig. 5, including:

trigger sentence detection module 61: the method comprises the steps of acquiring a trigger sentence matched with a target sentence set in target audio according to the target sentence set;

the fragment extraction module 62: the segment where the trigger sentence is located in the target audio is extracted;

result set generation module 63: the segment where the trigger sentence is located is added into the result set of the target audio;

the quality determination module 64: for determining the target audio content quality from segments in the result set.

In one embodiment, as shown in fig. 6, the apparatus further comprises:

the second sentence lookup module 71: the method comprises the steps of searching a second sentence, the similarity of which with a first sentence in a first sentence set reaches a first set threshold value, in a corpus;

target sentence addition module 72: if the second sentence does not exist in the first sentence set, the second sentence is used as a new first sentence, the new first sentence is added into the first sentence set, and the second sentence searching module is triggered again;

target sentence set construction module 73: and if the first sentence set comprises the second sentence, constructing a target sentence set according to the first sentence set, wherein sentences in the target sentence set are target sentences.

In one embodiment, the second sentence lookup module comprises:

In one embodiment, the similarity calculation unit is further configured to:

calculating the similarity of the first vector representation and the second vector representation by using the following formula to obtain the similarity of the target sentence and each sentence in the corpus:

wherein S is the similarity between A and B, A is the first vector representation, B is the second vector representation, n is the number of participles, i is the ith scoreWord, A_iIs the i-th word vector of the target sentence, B_iIs the ith word vector of a sentence in the corpus.

In one embodiment, the apparatus further comprises:

the trigger sentence detection module is further configured to:

In one embodiment, the regular expression acquisition module comprises:

screening unit: and the regular expression screening module is used for screening the candidate regular expressions according to the similarity between the sample sentences in the candidate set and the target sentences to obtain the regular expressions.

In one embodiment, the segment extraction module comprises:

In one embodiment, the trigger sentence includes a plurality of trigger sentences, and the number of sentences between the vector trigger sentences is smaller than a set value.

In one embodiment of the method of the present invention,

The functions of each module in each apparatus in the embodiments of the present invention may refer to the corresponding description in the above method, and are not described herein again.

Fig. 7 shows a block diagram of the structure of an audio content quality detection apparatus according to an embodiment of the present invention. As shown in fig. 7, the apparatus includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920 implements the audio content quality detection method in the above embodiments when executing the computer program. The number of the memory 910 and the processor 920 may be one or more.

The device also includes:

and a communication interface 930 for communicating with an external device to perform data interactive transmission.

Memory 910 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.

An embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and the computer program is used for implementing the method of any one of the above embodiments when being executed by a processor.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for audio content quality detection, comprising:

extracting a segment where the trigger sentence is located in the target audio;

2. The method of claim 1, wherein prior to the step of obtaining a trigger sentence in the target audio that matches the target set of sentences, the method further comprises:

3. The method according to claim 2, wherein searching the corpus for a second sentence having a similarity to the first sentence in the first sentence set reaching a first set threshold comprises:

4. The method of claim 2, wherein after constructing the set of target sentences, further comprising:

5. The method of claim 4, wherein obtaining the regular expression of the target sentence set according to the pattern description comprises:

obtaining a candidate regular expression by using the pattern description;

6. The method of claim 1, wherein extracting the segment of the target audio where the trigger sentence is located comprises:

7. The method of claim 1, wherein determining the target audio content quality from segments in the result set comprises:

extracting statistical characteristics of the fragments;

8. An audio content quality detection apparatus, comprising:

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 9, wherein the second sentence lookup module comprises:

11. The apparatus of claim 9, further comprising:

the trigger sentence detection module is further configured to:

12. The apparatus of claim 11, wherein the regular expression acquisition module comprises:

13. The apparatus of claim 8, wherein the segment extraction module comprises:

14. The apparatus of claim 8, wherein the quality determination module comprises:

15. An audio content quality detection apparatus, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.