WO2023207566A1 - 语音房质量评估方法及其装置、设备、介质、产品 - Google Patents

语音房质量评估方法及其装置、设备、介质、产品 Download PDF

Info

Publication number
WO2023207566A1
WO2023207566A1 PCT/CN2023/087339 CN2023087339W WO2023207566A1 WO 2023207566 A1 WO2023207566 A1 WO 2023207566A1 CN 2023087339 W CN2023087339 W CN 2023087339W WO 2023207566 A1 WO2023207566 A1 WO 2023207566A1
Authority
WO
WIPO (PCT)
Prior art keywords
noun
voice
nouns
basic
speech
Prior art date
Application number
PCT/CN2023/087339
Other languages
English (en)
French (fr)
Inventor
李益永
温偲
陈建强
陈德健
项伟
Original Assignee
广州市百果园信息技术有限公司
李益永
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司, 李益永 filed Critical 广州市百果园信息技术有限公司
Publication of WO2023207566A1 publication Critical patent/WO2023207566A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • This application relates to the field of instant messaging technology, and in particular to a voice room quality assessment method and its device, equipment, media, and products.
  • users of the live broadcast platform can communicate instantly in the form of voice, so a live broadcast room with the nature of instant calls is derived. Specifically, it can be a dedicated voice room. Users in the voice room can discuss topics and perform talents. Application purposes such as display, information sharing, and knowledge education can promote overall social benefits.
  • Live broadcast platforms usually support a large number of voice rooms concurrently. Different voice rooms have different quality due to the vastly different content of the speaking users.
  • the platform can use voice room quality evaluation technology to assist in screening high-quality voice rooms.
  • This application provides a voice room quality assessment method and its corresponding device, voice room recognition equipment, computer-readable storage media and computer program products.
  • a voice room quality assessment method including the following steps:
  • coding vector of the spoken text which coding vector includes statistical features of the number of sound source objects of the speech stream, statistical features of the total number of utterances, and statistical features of the number of valid nouns in the spoken text;
  • the quality category of the speech room is determined based on the encoding vector.
  • a voice room quality assessment device including:
  • a speech recognition module configured to obtain the speech stream in the speech room within a unit time period, and obtain the speech stream from the speech Spoken text is recognized in the stream;
  • a text encoding module configured to construct a coding vector of the spoken text, the coding vector including statistical features of the number of sound source objects of the speech stream, statistical features of the total number of utterances, and statistical features of the number of valid nouns in the spoken text;
  • a quality identification module configured to determine the quality category of the speech room according to the encoding vector.
  • a voice room recognition device including a central processor and a memory.
  • the central processor is configured to call and run a computer program stored in the memory to execute the voice room described in the present application. Steps in quality assessment methods.
  • a computer-readable storage medium which stores a computer program implemented according to the voice room quality assessment method in the form of computer-readable instructions.
  • the computer program is called and run by the computer. , execute the steps included in the method.
  • a computer program product which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the steps of the method described in any embodiment of the present application are implemented.
  • Figure 1 is a schematic diagram of the network architecture corresponding to the voice room operating environment applied in this application;
  • Figure 2 is a schematic flow chart of an embodiment of the voice room quality assessment method of the present application.
  • Figure 3 is a schematic flowchart of the process of identifying spoken text based on the speech stream in an embodiment of the present application
  • Figure 4 is a schematic flowchart of the process of constructing a coding vector in an embodiment of the present application
  • Figure 5 is a schematic flowchart of the process of obtaining statistical features based on word segmentation of spoken text in an embodiment of the present application
  • Figure 6 is a schematic flowchart of the process of segmenting the spoken text to obtain a word segmentation set in an embodiment of the present application
  • Figure 7 is a schematic flow chart of the process of determining each corresponding statistical feature according to the effective noun set of the spoken text in the embodiment of the present application;
  • Figure 8 is a schematic flowchart of the process of performing fuzzy matching on redundant subsets of a valid noun set and counting the number of noun hits in an embodiment of the present application;
  • Figure 9 is a schematic flowchart of the training process of the neural network classification model used to determine the quality category mapped to it according to the encoding vector in the embodiment of the present application;
  • Figure 10 is a schematic flowchart of the process of pushing a voice room recommendation list in response to a voice room recommendation request in an embodiment of the present application
  • Figure 11 is an exemplary graphical user interface of this application, used to display the voice room recommendation list
  • Figure 12 is a functional block diagram of the voice room quality assessment device of this application.
  • Figure 13 is a schematic structural diagram of a voice room recognition device used in this application.
  • FIG. 1 can be configured to deploy computer program products obtained by implementing various embodiments of the present application to provide voice room services.
  • a voice room running online is constructed for use in the voice room. of users implement online interactions.
  • the live broadcast room in traditional online live broadcast can also be regarded as a specific form of the voice room described in this application due to the presence of voice streams.
  • the application server 81 shown in Figure 1 can be used to support the implementation of the voice room, and the media server 82 can be used to process the forwarding of voice streams in each voice room, in which terminal devices such as computers 83 and mobile phones 84 serve as
  • the client is generally provided to users of the voice room, and provides a graphical user interface to the corresponding user through a front-end page or application program matching the voice room service to achieve human-computer interaction.
  • a voice room quality assessment method is provided according to one aspect of the present application. In one embodiment, it includes the following steps:
  • Step S1100 Obtain the voice stream in the voice room within the unit time period, and identify the spoken text from the voice stream;
  • Step S1200 Construct a coding vector of the spoken text, which coding vector includes statistical features of the number of sound source objects of the speech stream, statistical features of the total number of utterances, and statistical features of the number of valid nouns in the spoken text;
  • Step S1300 Determine the quality category of the speech room according to the encoding vector.
  • the voice room service of the live broadcast platform concurrently runs a large number of voice rooms, and the voice data generated by each voice room is uploaded to the media server in the format of streaming media, and the media server The corresponding voice stream is pushed to the terminal equipment of each receiving user in the corresponding voice room, thereby realizing support for instant communication in the voice room.
  • the voice stream corresponding to the voice room can be obtained from the media server.
  • a unit time period is preset, such as 20 minutes or 30 minutes. Those skilled in the art can flexibly set an appropriate time period, as long as the unit time period The segment can obtain an appropriate amount of voice content.
  • the unit can be traced back based on the current time, and the corresponding voice stream can be processed using the unit time period as the traceback duration. In other words, the unit can be processed every unit time period. The voice stream generated during the time period. This enables staged recognition of the voice stream continuously generated in the voice room.
  • the speech text generally includes speech sentences corresponding to each sound source object in the speech stream.
  • step S1200 in order to achieve a preliminary representation of the comprehensive quality of the spoken text, multiple statistical features of the speech stream corresponding to the unit time period may be used to construct a corresponding encoding vector.
  • the statistical features include statistical features characterized by the number of sound source objects in the voice stream, statistical features characterized by the total number of utterances in the voice stream, and statistical features characterized by effective nouns in the spoken text.
  • the number of audio source objects refers to the total number of users who effectively speak in the voice room within the unit time period, which can be obtained by the voice room service. For example, it can be submitted by monitoring each user's speaking behavior within the unit time period. The corresponding audio data is confirmed and counted.
  • any feasible sound source separation technology can be used to perform sound source separation on the voice stream. Such as this, those skilled in the art can flexibly implement according to the principles disclosed herein. It can be understood that the greater the number of audio source objects, the larger the number of speaking users in the voice room.
  • the total number of speeches refers to the total number of valid speeches in the voice room within the unit time period. Similarly, it can be obtained by the voice room service. For example, it can be obtained by monitoring the corresponding response submitted by each speech behavior within the unit time period. The audio data is confirmed and counted. For example, any feasible voice detection technology can be used to identify vocal segments from multiple sources. Such as this, those skilled in the art can flexibly implement according to the principles disclosed herein. It is understandable that the greater the total number of speeches, the more active the communication in the voice room is.
  • the statistical characteristics of the number of valid nouns in the spoken text refer to the data obtained by counting the nouns existing in the spoken text that match the nouns confirmed to be valid in advance, and can be provided by providing a manually labeled noun in advance
  • a basic noun list composed of basic nouns one or more counts obtained by matching each noun in the spoken text with the basic noun list in one or more ways are used as corresponding statistical features. It can be understood that the greater the number of effective nouns in a speaking text, the richer its corresponding information value.
  • the statistical characteristics of the number of audio source objects, the total number of speeches, the number of valid nouns in the speech text, etc. can be used to quantify the number of speaking users in the voice room, the activity of speeches, and the and the representation of the information value contained in the speech, which is constructed as a coding vector corresponding to the speech text, and the coding vector constitutes a preliminary representation of the quality information of the speech text.
  • a quality classification space is constructed in advance, which contains multiple quality categories.
  • the number of categories can be set as needed, for example, to represent the three categories of "high, medium, and low", or to represent "excellent, high quality,
  • the four categories of "ordinary, vulgar” and so on can be set by those skilled in the art.
  • various methods can be used to determine the quality category to which the encoding vector is mapped based on the encoding vector.
  • the quantitative mapping relationship from each statistical feature in the encoding vector to each quality category can be constructed based on a mathematical model. For example, each statistical feature is weighted and normalized to obtain a sum value, and the sum value is compared with The preset threshold intervals of each quality category match, and the quality category whose threshold interval matches the sum value is determined as the quality category mapped to the coding vector, that is, the quality category corresponding to the voice stream in the unit time period. . It is not difficult to see that using this method to implement is simple and requires less calculation, which is beneficial to saving system overhead and improving response speed.
  • the decision tree algorithm can be applied based on traditional machine learning principles, and any optimization algorithm such as ID3, CART, GBDT, XGB, etc. can be used to establish a mathematical model and solve it according to the encoding vector to obtain the quality category of its phase mapping. Specific examples are as follows:
  • i-th room is a voice room that belongs to the high-quality category at the j-th minute at 20 minutes.
  • the example ID3 algorithm is a decision tree algorithm.
  • the core principle of the ID3 algorithm is to select features for partitioning based on information gain, and then recursively build a decision tree.
  • the full English name of the example CART algorithm is Classification And Regression Trees, which is classification and regression trees. As the name suggests, the CART algorithm can be used for both classification and regression.
  • the example GBDT algorithm whose English full name is Gradient Boosting Decision Tree, is an integrated algorithm based on decision trees.
  • Gradient Boosting is an algorithm in the integrated method boosting, which iterates the new learner through gradient descent.
  • the example XGB algorithm also known as the XGBoost algorithm, is one of the ensemble learning methods with CART as the base classifier. It is used in data modeling competitions due to its excellent computing efficiency and prediction accuracy.
  • deep semantic information can be extracted from the encoding vector, and then mapped to the quality classification space with the help of a classifier, and each quality in the quality classification space can be The classification probability obtained by the category is based on the quality category corresponding to the maximum classification probability, thereby determining the quality category corresponding to the voice stream in the unit time period.
  • the neural network model should be pre-trained to a convergence state by those skilled in the art using a sufficient amount of training samples.
  • the basic model can be implemented using CNN (Convolutional Neural Network, Convolutional Neural Network), RNN (Recurent Neural Network, Recurrent Neural Network), etc., and the classifier can be constructed using the Softmax() function, Those skilled in the art can flexibly make selections based on the principles disclosed here. It is not difficult to see that this method takes into account the semantic correlation between various statistical features and is suitable for providing large-scale services.
  • the information based on the construction of the encoding vectors in this application is based on numerical information, which provides effective data for mathematical modeling and facilitates rapid modeling. and promote model convergence, which can save the cost of solving problems and improve the efficiency of solving voice room quality categories.
  • this application identifies the spoken text based on the voice stream generated in the unit time period of the voice room, and then compares the spoken text with the statistical features of the number of sound source objects, the statistical features of the total number of utterances, and the spoken text.
  • the statistical features of the number of valid nouns in the vector are constructed as coding vectors, and then the deep semantic information of the coding vector is used to determine the quality category corresponding to the speech stream. Since the data used to construct the coding vector are various types corresponding to the spoken text, Statistical features, instead of relying on the original audio features or the original speaking text, can represent the activity of the voice room with the help of two statistical features: the number of sound source objects and the total number of utterances.
  • the activity of the voice room can be expressed with the help of the statistical features of nouns in the speaking text.
  • Content quality the encoding vector thus constituted achieves an effective preliminary representation of the speech stream, including multi-modal information.
  • the quality category determined based on its deep semantic information is more accurate and credible, and can Provide scientific and reliable basic data for the platform to recommend voice rooms.
  • the step S1100 is to obtain the voice stream in the voice room within a unit time period, and identify the spoken text from the voice stream, including the following steps:
  • Step S1110 Obtain the voice stream per unit time period generated in real time by the voice room;
  • Step S1120 Perform human voice detection on the voice stream to determine the human voice segments of different sound source objects.
  • Step S1130 Perform speech recognition on the human voice segments to obtain speech text corresponding to each human voice segment.
  • step S1110 the voice stream generated in real time in the voice room can be collected in real time, and the corresponding length of the unit time period is used as the time unit, and real-time analysis of the voice stream in the unit time period is started to further improve the judgment of the voice.
  • the speed of the room quality category reflects the real-time quality information of the voice room more quickly.
  • step S1120
  • the VAD Voice Activity Detection, Voice Activity Detection
  • the VAD Voice Activity Detection, Voice Activity Detection statistical model is used to detect voice activity on the audio data in the voice stream, thereby removing the mute information, and determining the audio data whose VAD threshold exceeds the preset threshold as a human voice segment.
  • ASR Automatic Speech Recognition
  • Wenet model perform speech recognition on it, convert it into spoken text, and obtain each human voice. Speech text corresponding to the segment.
  • This embodiment performs real-time speech analysis on the speech stream generated immediately in the speech room, and can quickly obtain its corresponding speech text, filter out most of the invalid information in the speech stream, and greatly reduce the impact of environmental noise on the quality determination of the speech room. Make the voice room quality classification process faster.
  • the step S1200 of constructing the encoding vector of the spoken text includes the following steps:
  • Step S1210 Obtain the number of sound source objects in the voice stream of the unit time period to form corresponding statistical features
  • Step S1220 Obtain the total number of speeches in the voice stream in the unit time period to form corresponding statistical features
  • Step S1230 Count the number and structure of effective nouns in the speech text according to multiple preset dimensions. into corresponding statistical characteristics;
  • Step S1240 Construct the respective statistical features into coding vectors in a preset order.
  • the number of sound source objects can be predetermined by the voice room service, which can be obtained directly through interface calls, or by using any feasible sound source separation technology for real-time analysis of the voice stream in the unit time period. It is certain that in any case, the number of speaking users in the voice stream generated within a unit time period is certain, so the number of corresponding sound source objects is also certain. This can be used as one of the statistical features to characterize the speaking users in the voice room. the overall size.
  • step S1220
  • the voice room service when responsible for storing user behavior data corresponding to each speech of each user in the voice room, it can be performed based on these user behavior data
  • the total number of speeches is obtained by statistics.
  • it can be directly determined that the total number of human voice segments is the total number of speeches. times, thus determining the total number of speeches in the voice stream per unit time period, which can be used as one of the statistical features to characterize the activeness of users speaking in the voice room within the unit time period.
  • step S1230
  • Any number of dimensions can be set to examine the number of valid nouns in the speech text in different ways or granularities, and the numbers in each dimension can be used as corresponding statistical features in order to characterize the speech in different ways or at different granularities.
  • the value of information in a text can be set to examine the number of valid nouns in the speech text in different ways or granularities, and the numbers in each dimension can be used as corresponding statistical features in order to characterize the speech in different ways or at different granularities.
  • the basic noun table pre-includes manually annotated nouns as basic nouns, and then match each noun in the spoken text with the basic nouns according to different matching methods. Find matching basic nouns in the noun table. Whenever a matching basic noun is found, the number of valid nouns under this matching method adds up to 1 unit. Each matching method corresponds to a dimension, thereby determining different dimensions. The corresponding number of valid nouns below.
  • the basic nouns can be marked in a more fine-grained manner, and a preset classification can be set for each basic noun according to the preset classification standard, and then statistics The number of valid nouns that hit each preset category in the spoken text is used as a statistical feature of the corresponding subdivision granularity.
  • the classification criteria can be divided according to the information value of nouns and the recommendation purposes they serve.
  • the corresponding default classification is set to " Common nouns", “related nouns” and “commodity nouns”, among which common nouns can correspond to a General life nouns, such as “life”, “poetry”, “distance”, etc.; related nouns can correspond to nouns related to the user's shopping needs, such as “deposit”, “credit card”, “shopping mall”, etc.; product nouns can correspond to specific products Name, such as "shirt”, “mobile phone”, “computer”, etc. It can be seen that based on different service purposes, corresponding classification standards can be formulated to set corresponding classifications for the basic nouns in the basic noun list, thereby providing more fine-grained information value annotation for the basic nouns.
  • the first two methods can be flexibly combined as needed, and can be flexibly selected by those skilled in the art based on the principles disclosed here.
  • each basic noun is marked in advance and is given information value.
  • the basic nouns in the basic noun list are marked with a preset category, it is further combined with The information classification value, and the statistical characteristics in each dimension obtained thereby, can effectively represent the information value of the voice stream within the unit time period from different information value perspectives.
  • each statistical feature can be constructed into a coding vector in a certain preset order.
  • the preset sequence can be based on the input of a mathematical model for solving the quality category to which the coding vector is mapped. This does not affect the embodiment of the creative spirit of the present application, and those skilled in the art can flexibly determine based on the principles disclosed here.
  • This embodiment exemplarily reveals the construction process of the coding vector. From this, it can be seen that constructing the coding vector is also a process of preliminary representation of the information value of the speech stream in the speech room within a unit time period. By using multiple numerical statistical features to The information value of the voice stream is effectively represented, so that the coding vector has the technical basis for solving its corresponding quality category, and provides important basic information for guiding the mathematical model to accurately solve the quality category of the voice room.
  • step S1230 is to obtain the number of nouns in the spoken text according to multiple preset dimensions to form corresponding statistical features, including the following steps:
  • Step S1231 Extract nouns in the spoken text to obtain a noun set
  • Step S1232 Filter the noun set according to the default disabled word list to obtain a valid noun set
  • Step S1233 Determine the number of noun hits in the preset basic noun table in the valid noun set under each matching rule according to the corresponding matching rules provided in different preset dimensions, as the statistical characteristics of the corresponding dimension.
  • step S1231 in the full speech text corresponding to the speech stream in the unit time period obtained through speech recognition and text conversion, there may be some expressions with weak information value. Considering that nouns play a relatively important role in language expressions, Big facts, for this situation, can be analyzed by analyzing the speech text Necessary natural language processing is required to obtain the nouns and construct a noun set.
  • step S1232 in order to extract the validity of the nouns in the noun set, text preprocessing can be performed on the noun set, for example, referring to the preset stop word list to remove "the”, “is”, “which”, “Who”, “ah” and other preset stop words are purified, and a valid noun set is obtained after purification.
  • the corresponding matching rules can be determined according to different preset dimensions, and then according to the matching rules, each noun in the effective noun set is matched with The basic nouns in the basic noun table are matched, and the valid nouns that achieve matching are counted to determine the number of corresponding noun hits, which are used as statistical features of the corresponding dimensions.
  • a noun set is constructed by extracting nouns in the spoken text, then filtering stop words, and then constructing the statistical features corresponding to the spoken text required for encoding vectors based on the filtered effective noun set, thereby improving each Statistical features represent the accuracy and effectiveness of information value, allowing the coding vector to better guide the mathematical model to determine the quality category of the voice room.
  • step S1231 extracting nouns in the spoken text, includes the following steps:
  • Step S2311 Perform word segmentation on the spoken text to obtain a word segmentation set
  • Step S2312 Encode the word segments in the word segmentation set into embedding vectors
  • Step S2313 Extract deep semantic information from the embedding vector, perform part-of-speech recognition based on the deep semantic information, and determine the part-of-speech corresponding to each segmentation;
  • Step S2314 Extract word segments whose part-of-speech is nouns to construct the noun set.
  • word segmentation of the spoken text can be achieved by using various statistical word segmentation algorithms. For example, using the N-Gram algorithm to perform binary or ternary word segmentation on the spoken text, we can obtain The corresponding word segmentation set.
  • step S2312
  • any feasible vector encoding model such as Word2Vec can be used to encode each word segmentation in the word segmentation set and convert it into a corresponding embedding vector.
  • step S2313
  • part-of-speech recognition can be performed on each segment of the segment set based on the embedding vector.
  • any feasible neural network model based on deep learning can be used, for example, LSTM+CRF, Bert +CRF and other architectures implement any model in which the LSTM or Bert basic model performs representation learning on the embedding vector to obtain its corresponding deep semantic information, and then Then CRF (Conditional Random Field) performs part-of-speech recognition on it, so that the part-of-speech corresponding to each participle can be divided.
  • the part-of-speech can be set according to the grammatical part-of-speech, such as: noun, adjective, adverb, pronoun, etc.
  • step S2314 in order to construct the noun set, the participles belonging to nouns are extracted from the word segmentation set and constructed into a noun set.
  • the final noun set has information that more accurately represents the speech room.
  • the step S1233 is to determine the noun hits in the preset basic noun table in the valid noun set under each matching rule according to the corresponding matching rules provided in different preset dimensions.
  • Quantity as a statistical feature of the corresponding dimension, includes the following steps:
  • Step S2331 According to the precise matching rules, count the number of noun hits corresponding to the basic nouns in the basic noun table in the set of valid nouns that accurately hit the valid nouns, as a statistical feature of the comprehensive dimension;
  • Step S2332 According to the preset classification of basic nouns in the basic noun table, subdivide and count the number of noun hits corresponding to each preset classification under the precise matching rule as statistics corresponding to each preset classification dimension. characteristics;
  • Step S2333 According to the fuzzy matching rules, count the number of noun hits that do not accurately hit the valid nouns in the valid noun set but fuzzy hit the basic nouns in the basic noun table, as statistical features of the similarity dimension.
  • step S2331 the valid noun set obtained according to the embodiment of the present application is used as basic data for constructing the speech text in each preset dimension, and different matching rules can be adapted to different dimensions. Accordingly, this step first matches each valid noun in the valid noun set with the basic nouns in the basic noun table based on the precise matching rules, so as to determine how many valid nouns hit the basic noun table and use them as Statistical features under exact matching rules represent statistical features determined from comprehensive dimensions.
  • each valid noun to be matched is matched congruently with each basic noun in the basic noun table.
  • the strings of the two are the same, it is confirmed that the two match, and the corresponding The number of noun hits accumulates to 1 unit.
  • the basic noun list has been pre-marked and has corresponding information value. Therefore, from a comprehensive dimension, the more effective nouns that match the basic noun list, the higher the comprehensive information value of the effective noun set will be. .
  • each basic noun in the basic noun list can be preset classified according to a certain classification standard, thereby providing a more fine-grained classification information value for the basic noun.
  • the effective nouns that hit the basic noun table are classified and summarized according to the preset classification, and the effective nouns in the effective noun table that hit each of the The number of noun hits in a preset category can be used as a statistical feature corresponding to each preset category dimension.
  • the preset classification includes an indication of subdivision granularity
  • the statistical characteristics determined under each preset classification dimension effectively represent the richness of the information value of each preset classification.
  • fuzzy matching rules can be further applied to them and matched again with the basic nouns in the basic noun table to obtain from the
  • the basic nouns corresponding to the valid nouns in the basic noun table are matched as their synonyms, and then the total number of these synonyms, that is, the number of noun hits determined based on the similarity dimension, is counted as the corresponding statistical feature.
  • the fuzzy matching rules can be wild-matched using traditional fuzzy rule matching algorithms, or can be semantically matched using neural network models based on deep learning, and can be flexibly set by those skilled in the art. It is not difficult to understand that among all the valid nouns that do not accurately hit the basic noun table, there may be only a part that can achieve fuzzy matching with the basic noun table. In any case, the number of synonyms finally determined is the nouns determined by fuzzy matching. The number of hits can represent the information value of some of the effective nouns included in the set of effective nouns based on the degree of noun similarity, thereby achieving an effective representation of the information value of this part of the information in the form of corresponding statistical features.
  • the statistical features are:
  • the statistical characteristics corresponding to the effective nouns in the speech text can represent the corresponding information value, so that the subsequently obtained encoding vector can more accurately represent the effective information based on which the quality category of the speech room is determined.
  • the step S2333 is to count, according to the fuzzy matching rules, the number of noun hits in the valid noun set that do not accurately hit the valid nouns but fuzzy hits the basic nouns in the basic noun table. , as a statistical feature of similar dimensions, including the following steps:
  • Step S3331 Obtain the redundant subset of valid nouns in the valid noun set that do not accurately hit the basic noun table;
  • Step S3332 Calculate the semantic similarity between the vector of each valid noun in the redundant subset and the vector of each basic noun in the basic noun table;
  • Step S3333 Count the valid nouns with the highest semantic similarity exceeding the preset threshold, and count the number of noun hits that fuzzy hit the basic noun table.
  • step S3331 referring to the previous embodiment, after the set of valid nouns is matched with the basic noun table using precise matching rules, some of the valid nouns that do not accurately match the basic noun table can be determined. , this part of the valid nouns can be constructed as a redundant subset of the valid noun set to facilitate subsequent operations.
  • a text feature extraction model that has been pre-trained to a converged state is used to perform representation learning on each effective noun in the redundant subset and each basic noun in the basic noun table, where A vector that represents its deep semantic information.
  • the text feature extraction model described above is implemented using a neural network model.
  • any basic network model suitable for extracting text features such as Fasttext or Albert can be used.
  • Those skilled in the art can also access the classifier as needed to fine-tune and train it, so that it can learn vectors corresponding to the deep semantic information that accurately represents the effective nouns and basic nouns.
  • step S3333 based on the vector of each valid noun in the redundant subset, the semantic similarity between the vector of the valid noun and the vector of each basic noun in the basic noun table is calculated, thereby obtaining a similarity Degree matrix.
  • the value stored in each element represents the semantic similarity between the effective noun corresponding to the row where it is located and the basic noun corresponding to the column where it is located.
  • the semantics are expressed in the form of a matrix. Similarity, convenient and fast calculation.
  • Calculating the semantic similarity between two vectors can be implemented using any feasible data distance algorithm, including but not limited to cosine similarity algorithm, Euclidean distance algorithm, Pearson correlation coefficient algorithm, Jaccard coefficient algorithm, etc. Either type is available. After calculation, the corresponding calculation results are properly normalized so that the larger the representation value is, the more similar the two vectors are, and the corresponding semantic similarity value can be obtained and stored in the similarity matrix.
  • the semantic similarity corresponding to each basic noun can be used to determine whether the valid noun matches one of the basic nouns.
  • a specific method can provide a preset threshold as a measure of whether the similarity meets the matching threshold. Then, for the basic noun corresponding to the element with the highest semantic similarity value, compare its similarity value with the The preset threshold is compared. When the former exceeds the latter, it can be confirmed that the two vectors constitute a match, that is, the valid noun matches the basic noun. To this end, the number of noun hits in the similar dimension can be added to 1 unit, and when the former does not exceed the latter, it can be confirmed that the two vectors do not constitute a match. This principle can be used to determine whether each valid noun achieves fuzzy matching with the basic noun table. The number of noun hits obtained after finally traversing all valid nouns in the similarity matrix is the statistical feature under the similarity dimension.
  • the basic nouns in the noun table are fuzzy matched to determine the number of corresponding synonyms, that is, the number of noun hits in this similarity dimension, as the corresponding statistical feature. Based on this, semantic similarity is used to achieve the collection of effective nouns. Deeper data mining of the information value of nouns can avoid missing important information, so that the corresponding statistical features can more scientifically and fully represent the value of synonymous information, which can guide subsequent phonetic room category determination to obtain more accurate determination results.
  • a neural network model based on deep learning can be used to determine the corresponding quality category of the encoding vector.
  • the step S1300 is to determine the quality category according to the encoding vector.
  • the step of classifying the quality of the voice room is implemented using a neural network classification model that is pre-trained to a converged state.
  • the training process of the neural network classification model includes the following steps:
  • Step S4100 Call a single training sample in a preset data set.
  • the training sample includes the voice stream in a unit time period and the quality category marked for the voice stream;
  • Step S4200 Extract deep semantic information from the encoding vector corresponding to the speech stream of the training sample through a convolutional neural network
  • Step S4300 Classify and map the deep semantic information through a classifier to obtain a predicted quality category
  • Step S4400 Calculate the model loss value of the predicted quality category according to the annotated quality category.
  • Step S4500 Determine whether the model loss value reaches the preset threshold. When the model loss value does not reach the preset threshold, perform a gradient update on the model and call the next training sample to continue iterative training. Otherwise, determine that the model has converged and terminate the training.
  • the neural network classification model can use an ordinary convolutional neural network to perform representation learning on the input edit vector, and combine it with a classifier to map the representation learning results to a preset Quality classification space. Based on this, prepare a data set for training the neural network classification model to make it converge.
  • the data set can be sampled from the voice stream generated by the voice room of the live broadcast platform by those skilled in the art according to the methods disclosed in various embodiments of the present application, and the corresponding quality categories can be manually marked to form the data set.
  • training samples in the above dataset It is not difficult to understand that during sampling, the speech streams generated in different unit time periods of the same speech room can be collected to form different training samples. Usually, the speech streams of the same speech room in different unit time periods represent different information values. Therefore, The corresponding quality categories of the corresponding annotations may also be different. In short, the quality category corresponding to the voice stream in the training sample as the supervision label of the neural network classification model can be determined by manual annotation based on the actual information value of the voice stream.
  • any training sample can be directly used from the data set to obtain the speech stream and its annotated quality category.
  • the former is used to construct the input of the classification model.
  • the required encoding vectors are used to supervise the output of the classification model.
  • the method of constructing the corresponding coding vector for the speech stream in the training sample can be implemented in a corresponding manner according to any embodiment disclosed in this application.
  • the neural network classification model is in the training stage and inference
  • By maintaining the consistency of the encoding vector construction during this stage, its normal use can be determined.
  • step S4200 the convolutional neural network in the neural network classification model serves as the basic model and is responsible for representing and learning the encoding vector constructed corresponding to the speech stream in the training sample, thereby extracting its Deep semantic information.
  • step S4300 the deep semantic information then enters the classifier after being fully connected and is mapped into the quality classification space.
  • the deep semantic information is mapped to each quality category in the quality classification space.
  • of classification probabilities and the quality category with the largest classification probability is selected as the quality category corresponding to the encoding vector predicted by the model.
  • the quality classification space is preset for determining the voice quality level of the voice stream, and can be flexibly set by those skilled in the art, and will not be described in detail here.
  • step S4400 the pre-marked quality category in the training sample is used as the supervision label of the model output, and is used to calculate the model loss value corresponding to the quality category predicted by the model.
  • the loss function calculates the model loss value.
  • step S4500 in order to determine the iterative training process of the neural network classification model, a preset threshold is provided for the training of the classification model, and then the model loss value generated for the training sample is compared with the preset threshold For comparison, when the model loss value does not reach the preset threshold, backpropagation can be performed on each link of the classification model according to the model loss value to correct each of its aspects. The weight of the link realizes the gradient update of the classification model. When the model loss value reaches the preset threshold, it indicates that the classification model has been trained to a convergence state, so the training of the classification model can be terminated and it can be put into practical use.
  • a neural network classification model based on deep learning, after it is trained to a convergence state, it is used to determine the quality category of its mapping based on the encoding vector. Since this classification The model can deeply understand the semantic correlation information between each statistical feature in the encoding vector, and obtain the corresponding deep semantic information for classification mapping. Therefore, it has the effect of conducting in-depth data mining on the encoding vector to obtain effective information value. , based on which we can expect to obtain accurate quality category determination results.
  • Step S5100 Respond to the voice room recommendation request submitted by the terminal device and determine multiple candidate voice rooms and their corresponding basic recommendation scores according to the preset recommendation algorithm;
  • Step S5200 According to the preset weight of the quality category corresponding to each candidate voice room, adjust the corresponding basic recommendation score to obtain a recommendation display score;
  • Step S5300 Sort each candidate voice room in reverse order according to the recommendation display score to obtain a voice room recommendation list
  • Step S5400 Respond to the voice room recommendation request and push the voice room recommendation list to the terminal device for display.
  • step S5100 when the user of the live broadcast platform needs to obtain the corresponding voice room recommendation list on his terminal device through the page shown in Figure 11, he can enter the page for the first time or refresh the page. method to trigger the corresponding voice room recommendation request.
  • the voice room service After receiving the request, the voice room service can call the preset recommendation algorithm to determine multiple candidate voice rooms for it, and determine the corresponding basis for each candidate voice room based on the recommendation algorithm. Recommended rating.
  • the recommendation algorithm can be flexibly implemented on demand by those skilled in the art according to the preset recommendation business logic. For example, it can be based on the labels of the voice rooms visited in the user's historical behavior data, and the massive voice messages in the platform can be processed.
  • the room performs tag matching to match a personalized candidate voice room for the user, and the corresponding basic recommendation score is quantified according to the matching degree of the tag.
  • the recommendation algorithm can be implemented using a twin-tower model, which uses the vectors of tags of the voice rooms visited in the user's historical behavior data as one input, and uses the vectors of tags of all voice rooms in the platform as another input. Input, separate representation learning and semantic similarity matching are performed to ensure The corresponding semantic similarity is determined, and then multiple speech rooms are selected as the candidate speech rooms based on the semantic similarity. The corresponding semantic similarity of each candidate speech room can be used as its corresponding basic recommendation score.
  • step S5200
  • any one of the previous embodiments of this application can be used to determine its corresponding quality category.
  • it can correspond to each quality category of the quality classification system.
  • the weights used to adjust the recommendation display scores are respectively preset so that the higher the quality of the information actually represented, the higher the weight, and the lower the quality of the information actually represented, the lower the weight. This enables quantitative evaluation of different quality categories.
  • the preset weight of its corresponding quality category is used to multiply its recommended display score, and the obtained product can be used as its corresponding recommended display score. Since the weights have been quantified according to different quality categories, the recommendation display score is essentially the result of corresponding downgrading or weighting improvement of the recommendation display score.
  • step S5300 In step S5300,
  • each candidate voice room After each candidate voice room has obtained its corresponding recommendation display score, each candidate voice room can be sorted in reverse order according to the recommendation display score, so that the voice room with better quality is ranked first. Based on the inverse sorting result, the final Voice room recommendation list.
  • the voice room recommendation list may be pushed to the terminal device that submitted the voice room recommendation request to complete the response to the request.
  • the voice room recommendation list can encapsulate various necessary information of the corresponding voice room, including but not limited to the access entrance link of the corresponding voice room, the introduction of the voice room, etc.
  • the terminal device After the terminal device obtains the voice room recommendation list, it performs corresponding analysis and displays it in the graphical user interface, as shown in Figure 11.
  • the embodiment here exemplarily demonstrates the process in which the quality category identification capability implemented by this application serves the voice room recommendation business. It can be seen that when the voice room quality category is accurately and timely determined according to this application, the platform When recommending corresponding voice rooms to its users, the recommendation can be based on the information value of the voice rooms, which can effectively attract users to reside in the platform, and can also attract traffic to high-quality voice rooms, optimizing the voice room recommendation logic of the entire platform. Good economies of scale can be expected.
  • a voice room quality assessment device includes a voice recognition module 1100, a text encoding module 1200, and a quality recognition module 1300, wherein: the voice recognition module 1100 is configured to obtain The speech stream in the speech room within a unit time period, and the speech text is identified from the speech stream; the text encoding module 1200 is configured to construct the speech text
  • the encoding vector includes the statistical characteristics of the number of sound source objects of the speech stream, the statistical characteristics of the total number of utterances, and the statistical characteristics of the number of valid nouns in the speech text; the quality identification module 1300 is configured to according to the encoding The vector determines the quality category of the voice room.
  • the speech recognition module 1100 includes: a segment processing sub-module configured to obtain the speech stream of a unit time period generated in real time by the speech room; a human voice detection sub-module configured to detect the The voice stream is subjected to human voice detection to determine the human voice segments of different sound source objects; the recognition conversion submodule is configured to perform speech recognition on the human voice segments and obtain speech text corresponding to each human voice segment.
  • the text encoding module 1200 includes: a sound source statistics submodule configured to obtain the number of sound source objects in the speech stream of the unit time period to form corresponding statistical features; a speech statistics submodule, configured to obtain the total number of speeches in the voice stream in the unit time period to form corresponding statistical features; the noun statistics submodule is configured to count the number of effective nouns in the spoken text according to multiple preset dimensions to form corresponding statistics Feature; coding construction sub-module, configured to construct the respective statistical features into coding vectors in a preset order.
  • the noun statistics sub-module includes: a noun extraction unit configured to extract nouns in the spoken text to obtain a noun set; a noun filtering unit configured to extract nouns based on preset stop words.
  • the noun set is filtered by the table to obtain a valid noun set;
  • the matching statistics unit is configured to determine the number of noun hits in the preset basic noun table in the valid noun set under each matching rule according to the corresponding provided matching rules in different preset dimensions. , as the statistical characteristics of the corresponding dimension.
  • the noun extraction unit includes: a word segmentation subunit configured to segment the spoken text to obtain a word segmentation set; a vectorization subunit configured to encode the word segmentations in the word segmentation set into embeddings vector; a part-of-speech recognition subunit, configured to extract deep semantic information from the embedded vector, perform part-of-speech recognition based on the deep semantic information, and determine the part-of-speech corresponding to each segment; a noun extraction sub-unit, configured to extract a word segmentation structure in which the part-of-speech is a noun is the set of nouns.
  • the matching statistics unit includes: an accurate statistics secondary unit configured to count the nouns corresponding to the basic nouns in the basic noun table in the set of valid nouns according to the precise matching rules. The number of hits is used as a statistical feature of the comprehensive dimension; the subdivision statistics sub-unit is configured to accurately hit each preset classification phase according to the preset classification of the basic nouns in the basic noun table, and under the precise matching rules described in the subdivision statistics.
  • the corresponding number of noun hits is used as a statistical feature corresponding to each preset classification dimension; the fuzzy statistics secondary unit is configured to count, according to fuzzy matching rules, valid nouns in the set of valid nouns that are not accurately hit, but fuzzy hits in the basic noun table Foundation The number of noun hits for a noun as a statistical feature of the similarity dimension.
  • the fuzzy statistics secondary unit includes: a redundant construction subunit configured to obtain a redundant subset of valid nouns in the valid noun set that do not accurately hit the basic noun table; A similarity calculation subunit, configured to calculate the semantic similarity between the vector of each valid noun in the redundant subset and the vector of each basic noun in the basic noun table; a filtering and counting subunit, configured In order to count valid nouns with the highest semantic similarity exceeding a preset threshold, the number of noun hits that fuzzy hit the basic noun table is counted.
  • the quality identification module 1300 is implemented using a neural network classification model that is pre-trained to a converged state.
  • the neural network classification model is trained by a preset training device to perform a training task to a converged state.
  • the training The device includes: a sample calling module configured to call a single training sample in a preset data set, where the training sample includes a voice stream in a unit time period and a quality category annotated for the voice stream; a semantic extraction module configured to undergo convolution The neural network extracts deep semantic information from the encoding vector corresponding to the speech stream of the training sample; a classification mapping module configured to perform classification mapping on the deep semantic information through a classifier to obtain a predicted quality category; a loss calculation module, configured to calculate the model loss value of the predicted quality category according to the marked quality category; an iterative decision-making module configured to determine whether the model loss value reaches a preset threshold, and implement gradient on the model when the model loss value does not reach the preset threshold Update, call the next training sample to continue iterative training, otherwise it will be judged that the model has converged and the training will be terminated.
  • a sample calling module configured to call a single training sample in a preset data set, where the training sample includes a voice stream in
  • the quality identification module 1300 includes: a request response module configured to respond to a voice room recommendation request submitted by a terminal device, and determine multiple candidate voice rooms and their corresponding parameters according to a preset recommendation algorithm.
  • the basic recommendation score is configured to adjust the corresponding basic recommendation score according to the preset weight of the quality category determined corresponding to each candidate voice room to obtain the recommended display score;
  • the sorting processing module is configured to adjust the corresponding basic recommendation score according to the recommended
  • the display score sorts each candidate voice room in reverse order to obtain a voice room recommendation list;
  • the response push module is configured to respond to the voice room recommendation request and push the voice room recommendation list to the terminal device for display.
  • FIG. 13 a schematic diagram of the internal structure of the computer equipment.
  • the computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected through a system bus.
  • the computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions.
  • the database can store a sequence of control information.
  • the processor can implement a Voice room quality assessment method.
  • the processor is configured to execute the specific functions of each module in Figure 12, and the memory stores program codes and various types of data required to execute the above modules or sub-modules.
  • Network interfaces are configured for data transfer between user terminals or servers.
  • the memory in this embodiment stores the program codes and data required to execute all modules in the voice room quality assessment device of the present application, and the server can call the server's program codes and data to execute the functions of all modules.
  • This application also provides a storage medium storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, they cause one or more processors to execute the voice room quality evaluation method of any embodiment of this application. A step of.
  • the present application also provides a computer program product, which includes a computer program/instruction that implements the steps of the method described in any embodiment of the present application when executed by one or more processors.
  • this application can accurately determine the quality category of the voice stream generated by the voice room, improve the accuracy of recommending voice rooms to platform users, help activate platform user traffic, and increase platform user retention rate.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及即时通信技术领域中一种语音房质量评估方法及其装置、设备、介质、产品,所述方法包括:获取单位时间段内语音房中的语音流,从所述语音流中识别出说话文本;构造所述说话文本的编码向量,该编码向量包含所述语音流的音源对象数量统计特征、发言总次数统计特征、所述说话文本中有效名词数量的统计特征;根据所述编码向量确定所述语音房的质量类别。

Description

语音房质量评估方法及其装置、设备、介质、产品 技术领域
本申请涉及即时通信技术领域,尤其涉及一种语音房质量评估方法及其装置、设备、介质、产品。
背景技术
网络交互场景中,直播平台的用户之间能以语音的形式进行即时通信,因此衍生出具有即时通话性质的直播房间,具体可以是专用的语音房,语音房中的用户可以实现话题讨论、才艺展示、信息分享、知识教育等应用目的,能够促进整体社会效益。
直播平台通常并发支持海量的语音房,不同的语音房由于其中的发言用户的发言内容千差万别,表现出的质量也各有不同。平台出于向平台用户推荐语音房的需要,可以借助语音房质量评价技术辅助筛选优质的语音房。
传统的语音房质量评价技术,或采用语音特征输入预设模型进行识别,或采用语音转文字后的信息进行识别,实践中此类技术的评价效果均不佳,主要在于无论是语音特征还是语音文本,其原始信息由于用户发言情况的复杂性而导致内容掺杂而散乱,例如用户发言停顿过多、语气词过多、漫谈内容杂散、噪声过多等等,均会导致影响语音房质量评价效果,导致所确定的优质语音房的准确率不高,进而影响推荐效果。
发明内容
本申请提供一种语音房质量评估方法及其相应的装置、语音房识别设备、计算机可读存储介质以及计算机程序产品。
根据本申请的一个方面,提供一种语音房质量评估方法,包括如下步骤:
获取单位时间段内语音房中的语音流,从所述语音流中识别出说话文本;
构造所述说话文本的编码向量,该编码向量包含所述语音流的音源对象数量统计特征、发言总次数统计特征、所述说话文本中有效名词数量的统计特征;
根据所述编码向量确定所述语音房的质量类别。
根据本申请的另一方面,提供一种语音房质量评估装置,包括:
语音识别模块,配置为获取单位时间段内语音房中的语音流,从所述语音 流中识别出说话文本;
文本编码模块,配置为构造所述说话文本的编码向量,该编码向量包含所述语音流的音源对象数量统计特征、发言总次数统计特征、所述说话文本中有效名词数量的统计特征;
质量识别模块,配置为根据所述编码向量确定所述语音房的质量类别。
根据本申请的另一方面,提供一种语音房识别设备,包括中央处理器和存储器,所述中央处理器配置为调用运行存储于所述存储器中的计算机程序以执行本申请所述的语音房质量评估方法的步骤。
根据本申请的另一方面,提供一种计算机可读存储介质,其以计算机可读指令的形式存储有依据所述的语音房质量评估方法所实现的计算机程序,该计算机程序被计算机调用运行时,执行该方法所包括的步骤。
根据本申请的另一方面,提供一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被处理器执行时实现本申请任意一种实施例中所述方法的步骤。
附图说明
图1为本申请所应用的语音房运行环境相对应的网络架构示意图;
图2为本申请的语音房质量评估方法的实施例的流程示意图;
图3为本申请实施例中根据语音流识别说话文本过程的流程示意图;
图4为本申请实施例中构造编码向量的过程的流程示意图;
图5为本申请实施例中根据说话文本的分词获得统计特征的过程的流程示意图;
图6为本申请实施例中对所述说话文本进行分词获得分词集的过程的流程示意图;
图7为本申请实施例中根据所述说话文本的有效名词集确定各个相应的统计特征的过程的流程示意图;
图8为本申请实施例中对有效名词集的冗余子集进行模糊匹配统计名词命中数量的过程的流程示意图;
图9为本申请实施例中用于根据编码向量确定其相映射的质量类别的神经网络分类模型的训练过程的流程示意图;
图10为本申请实施例中响应语音房推荐请求而推送语音房推荐列表的过程的流程示意图;
图11为本申请示例性的图形用户界面,用于展示语音房推荐列表;
图12为本申请的语音房质量评估装置的原理框图;
图13为本申请所采用的一种语音房识别设备的结构示意图。
具体实施方式
请参阅图1所示的网络架构,其可配置为部署实现本申请的各个实施例所获得的计算机程序产品以提供语音房服务,通过该服务,构造出线上运行的语音房,供语音房内的用户实施线上交互。需要指出的是,传统的网络直播中的直播间,其由于存在语音流,也可以视为本申请所述的语音房的一种具体形式。
图1所示的应用服务器81可用于支持所述的语音房的实现,而媒体服务器82可用于处理各个语音房的语音流的转发,其中的计算机83、移动电话84之类的终端设备,作为客户端,一般提供给语音房的用户使用,通过所述语音房服务相匹配的前端页面或者应用程序向相应的用户提供图形用户界面以便实现人机交互。
请参阅图2,根据本申请的一个方面提供的一种语音房质量评估方法,在其一个实施例中,包括如下步骤:
步骤S1100、获取单位时间段内语音房中的语音流,从所述语音流中识别出说话文本;
步骤S1200、构造所述说话文本的编码向量,该编码向量包含所述语音流的音源对象数量统计特征、发言总次数统计特征、所述说话文本中有效名词数量的统计特征;及
步骤S1300、根据所述编码向量确定所述语音房的质量类别。
在步骤S1100中的示例性的应用场景中,直播平台的语音房服务并发运行海量的语音房,各个语音房所产生的语音数据以流媒体的格式上传至所述的媒体服务器,由媒体服务器将相应的语音流推送至相应语音房的各个接收用户的终端设备,从而实现对语音房即时通信的支持。由此,可以从媒体服务器中获取所述语音房相对应的语音流。
为了方便对所述语音流进行处理,预先设定一个单位时间段,例如20分钟或者30分钟均可,本领域技术人员可灵活设置适当时长,只要该单位时间 段可获取适量的语音内容即可。每次对一个语音房的语音流进行处理时,可基于当前时刻回溯,以该单位时间段为回溯时长取相对应的语音流进行处理,也即可每隔所述单位时间段便处理该单位时间段生成的语音流。由此可实现阶段性识别语音房内持续产生的语音流。
继而,采用任意一种可行的语音识别技术对所述单位时间段相对应的语音流进行语音转文本识别,便可获得其相应的说话文本。所述说话文本一般会包含语音流中各个音源对象相对应的说话语句。
在步骤S1200中,为了实现对所述说话文本的综合质量的初步表示,可采用与所述单位时间段相对应的语音流的多个统计特征来构造出相应的编码向量。所述统计特征包括以所述语音流中音源对象数量表征的统计特征、以所述语音流中的发言总次数表征的统计特征,以所述说话文本中的有效名词表征的统计特征。
所述的音源对象的数量,是指语音房内在所述单位时间段内有效发言的用户的总量,可由语音房服务获取,例如可以通过监听所述单位时间段内每个用户实施发言行为提交相应音频数据予以确认并计数,又如,可采用任何可行的音源分离技术对所述语音流实施音源分离获得。诸如此类,本领域技术人员可根据此处揭示的原理灵活实施。可以理解,音源对象的数量越多,表明语音房的发言用户规模越大。
所述的发言总次数,是指语音房内在所述单位时间段内有效发言的总次数,同理,可由语音房服务获取,例如可以通过监听所述单位时间段内每次发言行为提交的相应音频数据予以确认并计数,又如,可采用任何可行的人声检测技术识别多个音源的人声片段获得。诸如此类,本领域技术人员可根据此处揭示的原理灵活实施。可以理解,发言总次数越多,表明语音房的交流越活跃。
所述说话文本中有效名词数量的统计特征,是指存在于所述说话文本中的与预先确认为有效的名词相匹配的名词的计数所获得的数据,可通过预先提供一个以人工标注的名词作为基础名词构成的基础名词表,将所述说话文本中的各个名词与所述基础名词表进行一种或多种方式的匹配而相应获得的一个或多个计数,作为相应的统计特征。可以理解,说话文本中的有效名词数量越多,表示其相应的信息价值越丰富。
由此可见,音源对象数量、发言总次数、说话文本中有效名词数量等方面的统计特征,以量化的形式实现对所述语音房的发言用户规模、发言活跃度以 及发言所包含的信息价值的表征,将其构造为与所述说话文本相对应的编码向量,该编码向量便构成对所述说话文本的质量信息的初步表示。
本申请中,预先构造一个质量分类空间,在该质量分类空间中包含多个质量类别,类别数量可按需设定,例如表征“高、中、低”三类,或者表征“精彩、优质、普通、低俗”四类,诸如此类,可由本领域技术人员设定。在此基础上,可采用多种方式实现根据所述编码向量确定其相映射的质量类别。
一种方式中,可基于数学模型构造编码向量中各个统计特征到各个质量类别的数量映射关系,例如,对所述各个统计特征进行加权归一化,获得一个和值,将该和值与为各个质量类别预设的阈值区间相匹配,将阈值区间与该和值相匹配的质量类别确定为该编码向量相映射的质量类别,也即为所述单位时间段的语音流相对应的质量类别。不难看出,采用此种方式实现,计算简单,计算量少,有利于节省系统开销,提升响应速度。
另一方式中,可基于传统机器学习原理应用决策树算法,采用诸如ID3、CART、GBDT、XGB等任意优化算法建立数学模型根据编码向量进行求解,获得其相映射的质量类别,具体示例如下:
设X=(x1,x2,…,x7),xij为第i个语音房第j个20分钟的特征,yij为第i个房间第j个20分钟的的标签,yij=2表示第i个房间第j个20分钟时属于高质量类别的语音房,yij=1表示第i个房间第j个20分钟时属于普通质量类别的语音房,yij=0表示第i个房间第j个20分钟时属于低质量类别的语音房。将Xij随机排序得到训练集V=(Z1,Z2,…,Zm),其中,m为样本数量,Qi为Zi对应的标签。
至此可知,采用决策树算法建立的最优化数学模型如下:

此处,采用XGB算法进行求解,当然也可以其他已知算法实施,本领域技术人员可灵活选用。不难看出,由于编码向量中各个统计特征均基于数值生成,因而,采用此一方式求解语音房的质量类别将获得高效快速可解析的便利,有助于节约整体实现成本。
示例的ID3算法是一种决策树算法,ID3算法的核心原理是根据信息增益来选择进行划分的特征,然后递归地构建决策树。
示例的CART算法,其英文全称是Classification And Regression Trees,也就是分类与回归树,顾名思义,CART算法既可以用来分类,也可以用来回归。
示例的GBDT算法,其英文全称为Gradient Boosting Decision Tree,是一种基于决策树的集成算法。其中Gradient Boosting是集成方法boosting中的一种算法,通过梯度下降来对新的学习器进行迭代。
示例的XGB算法,也称XGBoost算法,其以CART为基分类器的集成学习方法之一,由于其出色的运算效率和预测准确率在数据建模比赛中得到应用。
再一方式中,可基于深度学习原理,以神经网络模型为基础模型,对所述编码向量提取深层语义信息,再借助分类器将其映射到质量分类空间中,根据质量分类空间内的各个质量类别获得的分类概率,取最大分类概率对应的质量类别为准,从而确定所述单位时间段的语音流相对应的质量类别。当然,所述神经网络模型应当由本领域技术人员采用足量训练样本预先训练至收敛状态。其中,所述的基础模型,可采用CNN(Convolutional Neural Network,卷积神经卷积网络)、RNN(Recurent Neural Network,循环神经网络)之类实现,所述分类器可采用Softmax()函数构建,本领域技术人员均可根据此处揭示的原理灵活选型。不难看出,采用此种方式,兼顾了各个统计特征之间的语义上的关联性,适用于提供大规模服务。
由以上各种求解编码向量相对应的质量类别的数学模型的丰富性可以看出,本申请构造编码向量所依据的信息均基于数值信息进行,为数学建模提供了有效数据,方便快速建模和促进模型收敛,可以节省解决问题的成本,而提升求解语音房质量类别的效率。
根据此处揭示的实施例可知,本申请根据语音房的单位时间段生成的语音流识别出说话文本,然后,将说话文本相对应音源对象数量统计特征、发言总次数统计特征、所述说话文本中有效名词数量的统计特征构造为编码向量,再利用编码向量的深层语义信息确定该段语音流相对应的质量类别,由于用于构造编码向量的数据是与所述说话文本相对应的各种统计特征,而不是依赖于原始音频特征或者原始说话文本,借助音源对象数量的、发言总次数两个统计特征可以表示语音房的活跃度,借助说话文本中名词的统计特征则可表示语音房的内容质量,由此构成的编码向量实现对的所述语音流的有效的初步表示,包含多模态信息,在此基础上根据其深层语义信息确定出的质量类别,更为准确可信,能够为平台推荐语音房提供科学可靠的基础数据。
请参阅图3,根据本申请变通的实施例中,所述步骤S1100、获取单位时间段内语音房中的语音流,从所述语音流中识别出说话文本,包括如下步骤:
步骤S1110、获取语音房即时生成的单位时间段的语音流;
步骤S1120、对所述语音流进行人声检测,确定其中不同音源对象的人声片段;及
步骤S1130、对所述人声片段进行语音识别,获得各个人声片段相对应的说话文本。
在步骤S1110中,可以对所述的语音房实时采集其即时生成的语音流,以单位时间段的对应长度为时间单位,对该单位时间段内的语音流启动实时分析,以便进一步提升判定语音房的质量类别的速度,更快速地反映语音房的实时质量信息。
在步骤S1120中,
采用VAD(Voice Activity Detection,语音活动检测)统计模型对所述语音流中的音频数据检测语音活动,从而去除其中的静音信息,将VAD阈值超过预设阈值的音频数据确定为人声片段,由此获得各次发言相对应的人声片段。由于语音房服务通常会预先对所述语音流进行音源分离,或者也可由本申请自行采用音源分离算法实现音源分离,因而,所述人声片段可以是按照不同音源对象确定的。
继而,针对每个人声片段,采用任意可行的基于自动语音识别技术(ASR,Automatic Speech Recognition)实现的语音识别模型例如Wenet模型对其进行语音识别,将其转换为说话文本,从而获得各个人声片段相对应的说话文本。
本实施例通过对语音房即时产生的语音流进行实时语音分析,可快速获取其相对应的说话文本,过滤语音流中大部分的无效信息,极大地降低环境噪声对语音房质量判定的影响,使语音房质量分类过程更为快速。
请参阅图4,根据本申请变通的实施例中,所述步骤S1200、构造所述说话文本的编码向量,包括如下步骤:
步骤S1210、获取所述单位时间段的语音流中的音源对象数量构成相应的统计特征;
步骤S1220、获取所述单位时间段的语音流中的发言总次数构成相应的统计特征;
步骤S1230、根据多个预设维度统计所述说话文本中的有效名词的数量构 成相应的统计特征;及
步骤S1240、按预设顺序将所述各个统计特征构造为编码向量。
在步骤S1210中,所述的音源对象数量可以由语音房服务预先确定,对此可直接通过接口调用获取,或者通过对所述单位时间段的语音流采用任意可行的音源分离技术进行实时分析而确定,无论如何,单位时间段内所产生的语音流中,其发言用户的数量是确定的,因而对应的音源对象数量也是确定的,将其作为统计特征之一,可以表征语音房内发言用户的总体规模。
在步骤S1220中,
对于所述单位时间段的语音流中的发言总次数,一种方式中,当语音房服务负责存储语音房内各个用户的各次发言相对应的用户行为数据时,可以根据这些用户行为数据进行统计获得所述的发言总次数,而另一方式中,结合本申请前文的实施例中采用VAD进行人声片段检测的方式,可直接确定所述的人声片段的总数为所述的发言总次数,由此,便确定了单位时间段的语音流中的发言总次数,将其作为统计特征之一,可用于表征语音房在该单位时间段内的用户发言活跃程度。
在步骤S1230中,
可以设定任意多个维度,分别从不同方式或粒度考察所述说话文本中的有效名词的数量,将各个维度下的数量作为相应的统计特征,以便实现从不同方式或不同粒度表征所述说话文本中的信息价值。
例如,一种方式中,可以参考给定的基础名词表,其中的基础名词表预先收录人工标注的名词作为基础名词,然后分别根据不同的匹配方式为说话文本中的每个名词在所述基础名词表中查找出相匹配的基础名词,每当查找出相匹配的基础名词时,该匹配方式下的有效名词数量加计1个单位,其中,每种匹配方式对应一个维度,从而确定不同维度下相应的有效名词数量。
另一方式中,可以在所述基础名词表的基础上,对其中的基础名词进行更细粒度的标注,按照预设的分类标准,为每个基础名词对应设置一个预设分类,然后,统计所述说话文本中的名词命中各个预设分类的有效名词数量,作为相应细分粒度的统计特征。
所述的分类标准,示例而言,可以是按照名词的信息价值及其所服务的推荐目的而划分的,例如服务于商品推荐而制定的分类标准中,设定其相应的预设分类为“普通名词”、“关联名词”、“商品名词”,其中,普通名词可对应一 般生活名词,例如“生活”、“诗”、“远方”等;关联名词可对应与用户购物需求相关的名词,例如“订金”、“信用卡”、“商场”等;商品名词可对应具体商品名称,例如“衬衣”、“手机”、“电脑”等。可见,基于不同的服务目的,可以制定相应的分类标准为基础名词表的基础名词设定相应的分类,从而为基础名词提供更细粒度的信息价值标注。
变通的方式中,可以按需灵活综合前两种方式,由本领域技术人员根据此处揭示的原理灵活选用即可。
不难理解,由于据以确定说话文本中的有效名词的基础名词表中,各个基础名词预先经过标注而被赋予信息价值,例如是基础名词表中基础名词被标注预设分类时,进一步结合了信息分类价值,由此而获得的各个维度下的统计特征,可以从不同信息价值角度实现对所述单位时间段内的语音流的信息价值的有效表征。
最后,获得多个所述的统计特征后,可按一定的预设顺序将各个统计特征构造为编码向量,所述的预设顺序可根据求解该编码向量相映射的质量类别的数学模型的入参而定,对此,不影响本申请的创造精神的体现,本领域技术人员可根据此处揭示的原理灵活确定。
本实施例示例性地揭示了编码向量的构造过程,据此可见,构造编码向量也是对语音房中单位时间段内的语音流的信息价值的初步表示的过程,通过以多个数值统计特征对所述语音流的信息价值进行有效表示,使所述编码向量具备进行求解其相对应的质量类别的技术基础,对于指导数学模型准确求解出语音房的质量类别提供了重要的基础信息。
请参阅图5,根据本申请变通的实施例中,所述步骤S1230、根据多个预设维度获取所述说话文本中的名词的数量构成相应的统计特征,包括如下步骤:
步骤S1231、提取所述说话文本中的名词,获得名词集;
步骤S1232、根据预设的停用词表过滤所述名词集以获得有效名词集;及
步骤S1233、根据预设的不同维度相应提供的匹配规则,确定每种匹配规则下有效名词集命中预设的基础名词表的名词命中数量,作为相应维度的统计特征。
在步骤S1231中,经语音识别并转换文字获得的所述单位时间段内的语音流相对应的全量说话文本中,可能存在一些信息价值较弱的表达,考虑到语言表达中名词所起作用较大的事实,针对此一情况,可通过对所述说话文本进行 必要的自然语言处理,获得其中的名词,构造为一个名词集。
在步骤S1232中,为提取名词集中名词的有效性,可对所述名词集进行文本预处理,例如参考预设的停用词表,去除其中的“the”、“is”、“which”、“谁”、“啊”等等预设停用词实现净化,净化后即获得有效名词集。
参考前一实施例所揭示,在所述有效名词集的基础上,可根据不同的预设维度,确定其相对应的匹配规则,然后根据该匹配规则,将所述有效名词集中的各个名词与所述基础名词表中的基础名词进行匹配,对其中实现匹配的有效名词进行计数从而确定相应的名词命中数量,作为相应维度的统计特征。
本实施例中,通过提取所述说话文本中的名词构造名词集,然后进行停用词过滤,再根据过滤后的有效名词集构造编码向量所需的说话文本相对应的统计特征,提升了各个统计特征表示信息价值的精准度和有效性,使编码向量能更好地指导数学模型进行语音房质量类别判定。
请参阅图6,根据本申请变通的实施例中,所述步骤S1231、提取所述说话文本中的名词,包括如下步骤:
步骤S2311、对所述说话文本进行分词,获得分词集;
步骤S2312、将分词集中的分词编码为嵌入向量;
步骤S2313、对所述嵌入向量提取深层语义信息,根据深层语义信息进行词性识别,确定各个分词相对应的词性;及
步骤S2314、抽取其中词性为名词的分词构造为所述名词集。
在步骤S2311中,对所述说话文本进行分词,可采用各种基于统计的分词算法实现,示例而言,采用N-Gram算法,对所述说话文本进行二元或三元分词,便可获得相应的分词集。
在步骤S2312中,
为了方便对所述分词集进行语义提取以确定各个分词的词性,可采用诸如Word2Vec之类的任意可行的向量编码模型对所述分词集中的各个分词进行编码,将其转换为相应的嵌入向量。
在步骤S2313中,
继而,可在所述嵌入向量的基础上对所述分词集的各个分词进行词性识别,进行语义识别时,可采用任意可行的基于深度学习的神经网络模型实施,例如,采用LSTM+CRF、Bert+CRF等架构所实现的任意一种模型,由其中的LSTM或Bert基础模型对所述嵌入向量进行表示学习,获得其相应的深层语义信息,然 后由CRF(条件随机场)对其进行词性识别,由此便可划分出各个分词相对应的词性,所述的词性按照语法词性设置即可,例如:名词、形容词、副词、代词等。
在步骤S2314中,为了构造所述的名词集,将所述分词集中,属于名词的分词抽取出来,构造为名词集即可。
根据本实施例可以看出,对于单位时间段的语音流相对应的说话文本,经过分词、编码、词性识别、关键词抽取等环节,最终所获得的名词集,具有更精准表示语音房的信息内容的价值的效果,在此基础上确定编码向量,对于指导数学模型对语音房的质量类别的求解而言,奠定了非常坚实的数据挖掘基础。
请参阅图7,根据本申请变通的实施例中,所述步骤S1233、根据预设的不同维度相应提供的匹配规则,确定每种匹配规则下有效名词集命中预设的基础名词表的名词命中数量,作为相应维度的统计特征,包括如下步骤:
步骤S2331、根据精准匹配规则,统计有效名词集中有效名词精准命中所述基础名词表中的基础名词相对应的名词命中数量,作为综合维度的统计特征;
步骤S2332、根据所述基础名词表中基础名词的预设分类,细分统计所述精准匹配规则下,精准命中各个预设分类相对应的名词命中数量,作为各个预设分类维度相对应的统计特征;及
步骤S2333、根据模糊匹配规则,统计有效名词集中有效名词未精准命中、而模糊命中所述基础名词表中的基础名词的名词命中数量,作为相似维度的统计特征。
在步骤S2331中,根据本申请的实施例获得的有效名词集,被作为构造各个预设维度下所述说话文本的基础数据,而不同的维度,可以适配不同的匹配规则。据此,本步骤先基于精准匹配规则,将有效名词集中的各个有效名词与所述基础名词表中的基础名词进行匹配,以便确定有多少个有效名词命中所述的基础名词表,将其作为精准匹配规则下的统计特征,表示从综合维度确定的统计特征。
应用所述的精准匹配规则时,将每个待匹配的有效名词与所述基础名词表中的各个基础名词进行全等匹配,当两者字符串相同时,确认为两者相匹配,相应的名词命中数量累计1个单位。由于如前所述,基础名词表经预先标注而具有相应的信息价值,因而,从综合维度上,与所述基础名词表实现匹配的有效名词越多,有效名词集的综合信息价值便越高。
在步骤S2332中,根据本申请的前文实施例所揭示,所述基础名词表中的各个基础名词,可按照一定的分类标准预设分类,从而为该基础名词提供更细粒度的分类信息价值,鉴此,同样基于所述的精准匹配规则,对有效名词集中命中所述基础名词表的有效名词,按照预设分类进行分类汇总,便可获得所述有效名词表中的有效名词命中所述各个预设分类的名词命中数量,可作为各个预设分类维度相对应的统计特征。
由于所述预设分类包含细分粒度的指示作用,因此,各个预设分类维度下确定的统计特征,便对各个预设分类的信息价值的丰富程度进行了有效表征。
在步骤S2333中,
继而,对于有效名词集中根据精准匹配规则未精准命中所述基础名词表的部分有效名词,可以进一步为其应用模糊匹配规则,再度与所述基础名词表中的基础名词进行匹配,以从所述基础名词表中匹配出该部分有效名词相对应的基础名词作为其同义词,然后统计这些同义词的总量,即基于相似维度确定的名词命中数量,作为相应的统计特征。
所述的模糊匹配规则,可以采用传统的模糊规则匹配算法进行通配,也可采用基于深度学习的神经网络模型进行语义匹配,可由本领域技术人员灵活设定。不难理解,未精准命中所述基础名词表的全量有效名词中,可能仅有一部分能与所述基础名词表实现模糊匹配,无论如何,最终确定的同义词数量,也即经模糊匹配确定的名词命中数量,能够从名词相近程度表征所述有效名词集中包含的部分有效名词的信息价值,从而以相应的统计特征的形式,实现对这部分信息价值的有效表征。
根据此处揭示的实施例,可以看出,在基于所述说话文本的有效名词集确定相应的统计特征时,不仅考虑了有效名词精准命中基础名词表的情况,也考虑有效名词模糊命中基础名词表的情况,不仅考虑了精准命中基础名词表的综合情况,也考虑了精准命中基础名词表中的各个预设分类的具体情况,实现从不同维度、不同侧面分别提取统计特征,该统计特征为所述说话文本中的有效名词相对应的统计特征,可以表示相应的信息价值,使后续所获得的编码向量能更精准地表示据以判定语音房的质量类别的有效信息。
请参阅图8,根据本申请变通的实施例中,所述步骤S2333、根据模糊匹配规则,统计有效名词集中有效名词未精准命中、而模糊命中所述基础名词表中的基础名词的名词命中数量,作为相似维度的统计特征,包括如下步骤:
步骤S3331、获取所述有效名词集中未精准命中所述基础名词表的有效名词构成冗余子集;
步骤S3332、计算所述冗余子集内每个有效名词的向量与所述基础名词表中的每个基础名词的向量两两之间的语义相似度;及
步骤S3333、对存在最高语义相似度超过预设阈值的有效名词进行计数,统计出模糊命中所述基础名词表的名词命中数量。
在步骤S3331中,参考前一实施例,当所述有效名词集被运用精准匹配规则与所述基础名词表进行匹配后,便可确定其中未与所述基础名词表实现精准匹配的部分有效名词,可将该部分有效名词另行构造为所述有效名词集的一个冗余子集,以方便后续运算。
在步骤S3332中,本实施例,采用一个经预训练至收敛状态的文本特征提取模型对所述冗余子集中的各个有效名词以及对所述基础名词表中的各个基础名词进行表示学习,其中表征其深层语义信息的向量。所述的文本特征提取模型,采用神经网络模型实现,例如采用Fasttext、AlBert等任意适于提取文本特征的基础网络模型均可。本领域技术人员也可按需接入分类器对其进行微调训练,从而使其习得精准表示所述有效名词、基础名词的深层语义信息相对应的向量即可。
在步骤S3333中,继而,基于所述冗余子集中每个有效名词的向量,计算该有效名词的向量与所述基础名词表中各个基础名词的向量之间的语义相似度,从而获得一个相似度矩阵,该矩阵中,每个元素所存储的数值,表示其所在的行相对应的有效名词与其所在的列相对应的基础名词之间的语义相似度,通过矩阵的形式表示所述的语义相似度,方便快速运算。
计算两两向量之间的语义相似度,可采用任意可行的数据距离算法来实现,包括但不限余弦相似度算法、欧氏距离算法、皮尔逊相关系数算法、杰卡德系数算法等等任意一种均可。经计算后,将相应的计算结果进行适当的归一化,使其表征数值越大则两个向量越相似,即可获得相应的语义相似度数值,存储于所述的相似度矩阵中。
在所述相似度矩阵中,对于每个有效名词而言,其对应到各个基础名词的语义相似度,可以用于判定该有效名词是否与其中的一个基础名词构成相匹配。具体的方式,可提供一个预设阈值,作为衡量相似度是否满足匹配门槛,然后,对于其中语义相似度数值最高的元素相对应的基础名词,将其相似度数值与该 预设阈值进行比较,当前者超过后者时,则可确认两个向量构成匹配,也即该有效名词与该基础名词相匹配,为此,可将相似维度下的名词命中数量加计1个单位,而当前者未超过后者时,则可确认两个向量不构成匹配。对于每个有效名词是否与基础名词表实现模糊匹配均采用此一原理确定即可,最终遍历相似度矩阵的全量有效名词之后所获得的名词命中数量便是相似维度下的统计特征。
根据此处揭示的实施例可以理解,在确定所述说话文本在相似维度下的统计特征时,将未精准命中所述基础名词表的部分有效名词,再行基于语义相似性,与所述基础名词表中的基础名词进行模糊匹配,从而确定出相应的同义词的数量,即该相似维度下的名词命中数量,作为相应的统计特征,据此,借助语义相似性实现对所述有效名词集中的名词的信息的价值的更深度的数据挖掘,避免遗漏重要信息,使相应的统计特征更能科学充分地表示同义信息价值,从而可指导后续的语音房类别判定获得更准确的判定结果。
请参阅图9,根据本申请变通的实施例中,可采用基于深度学习的神经网络模型实现对所述编码向量确定其相应的质量类别,为此,所述步骤S1300、根据所述编码向量确定所述语音房的质量类别的步骤,采用预先训练至收敛状态的神经网络分类模型实现,该神经网络分类模型的训练过程,包括如下步骤:
步骤S4100、调用预设的数据集中的单个训练样本,所述训练样本包括单位时间段的语音流及为该语音流标注的质量类别;
步骤S4200、经卷积神经网络对所述训练样本的语音流相应的所述编码向量提取深层语义信息;
步骤S4300、经分类器对所述深层语义信息进行分类映射,获得预测的质量类别;
步骤S4400、根据标注的质量类别计算预测的质量类别的模型损失值;及
步骤S4500、判断所述模型损失值是否达到预设阈值,当该模型损失值未达到预设阈值时对模型实施梯度更新,调用下一训练样本继续实施迭代训练,否则判定模型收敛,终止训练。
在步骤S4100中,示例而言,所述神经网络分类模型可采用普通的卷积神经网络用于对所输入的编辑向量进行表示学习,并结合一个分类器用于将表示学习结果映射到预设的质量分类空间。据此,预备一个数据集,用于对该神经网络分类模型实施训练,以使其收敛。
所述的数据集,可由本领域技术人员根据本申请的各个实施例所揭示的方式,从直播平台的语音房所产生的语音流中采样,并经人工标注其相应的质量类别后,构成所述数据集中的训练样本。不难理解,采样时,可以采集同一语音房不同单位时间段产生的语音流以构成不同的训练样本,通常,同一语音房在不同单位时间段的语音流所表征的信息价值不同,因而,为其相应标注的质量类别也可不同,总之,所述训练样本中语音流相对应的作为神经网络分类模型的监督标签的质量类别,可以根据该语音流的实际信息价值,由人工标注确定。
当对所述神经网络分类模型实施一次训练时,可直接从所述数据集中采用任意一个训练样本,获得其中的语音流及为其标注的质量类别,前者用于构造所述分类模型的输入所需的编码向量,后者用于监督所述分类模型的输出。
对所述训练样本中的语音流构造其相应的编码向量的方式,按照本申请所揭示的任意一个实施例相应的方式对应实施即可,总之,只要所述神经网络分类模型在训练阶段及推理阶段保持编码向量构造的一致性,即可确定其正常使用。
在步骤S4200中,如前所述,神经网络分类模型中的卷积神经网络作为基础模型,负责对所述训练样本中的语音流相对应构造的所述编码向量进行表示学习,从而提取出其深层语义信息。
在步骤S4300中,继而,所述深层语义信息经全连接后进入分类器,被映射到根据质量分类空间中,从而,预测出所述深层语义信息映射到质量分类空间中的各个质量类别相对应的分类概率,取其中分类概率最大的质量类别作为模型预测出的对应所述编码向量的质量类别。所述的质量分类空间,如前所述,是为判定语音流的语音质量等级而预设的,可由本领域技术人员灵活设定,此处恕不赘述。
在步骤S4400中,所述训练样本中预先标注的质量类别被作为模型输出的监督标签,用于计算模型预测出的质量类别相对应的模型损失值,鉴于采用分类器的事实,可采用交叉熵损失函数计算所述的模型损失值。
在步骤S4500中,为了决策所述神经网络分类模型的迭代训练过程,为该分类模型的训练提供一个预设阈值,然后,将针对该训练样本所产生的所述模型损失值与该预设阈值进行比较,当该模型损失值达未达到该预设阈值时,便可根据所述模型损失值对所述分类模型的各个环节实施反向传播以修正其各个 环节的权重,实现对分类模型的梯度更新。当该模型损失值达到所述预设阈值时,表明该分类模型已经被训练至收敛状态,从而可终止该分类模型的训练,将其投入实用即可。
根据此处的实施例可知,借助基于深度学习实现的神经网络分类模型,在其被训练至收敛状态后,将其用于根据所述的编码向量确定出其相映射的质量类别,由于该分类模型可以对所述编码向量中各个统计特征之间的语义关联信息进行深度理解,获得相应的深层语义信息进行分类映射,因而,具有对所述编码向量进行深度数据挖掘而获得有效信息价值的效果,据此可期望获得精准的质量类别判定效果。
请参阅图10,根据本申请变通的实施例中,所述步骤S1300、根据所述编码向量确定所述语音房的质量类别的步骤之后,包括如下步骤:
步骤S5100、响应终端设备提交的语音房推荐请求,根据预设推荐算法确定多个候选语音房及其相应的基础推荐评分;
步骤S5200、根据每个候选语音房相对应确定的所述质量类别的预设权重,调整相应的基础推荐评分而获得推荐展示评分;
步骤S5300、根据推荐展示评分对各个候选语音房进行倒排序,获得语音房推荐列表;及
步骤S5400、应答语音房推荐请求,将所述语音房推荐列表推送至所述终端设备显示。
在步骤S5100中的示例性的应用场景中,当直播平台的用户需要在其终端设备通过如图11所示的页面获得相应的语音房推荐列表时,可通过初次进入该页面或刷新该页面的方式而触发相应的语音房推荐请求,语音房服务接收到该请求后,可调用预设的推荐算法,为其确定多个候选语音房,并且根据该推荐算法确定各个候选语音房相对应的基础推荐评分。
所述的推荐算法,可由本领域技术人员根据预设的推荐业务逻辑按需灵活实施,例如,可根据所述的用户的历史行为数据中所访问的语音房的标签,对平台中海量的语音房进行标签匹配,而为该用户匹配出个性化的候选语音房,并且,根据所述标签的匹配程度量化出相应的基础推荐评分。
一种实施方式中,所述推荐算法可以采用双塔模型实现,其将用户历史行为数据中所访问的语音房的标签的向量作为一路输入,将平台中全量语音房的标签的向量作为另一路输入,分别进行表示学习后进行语义相似匹配,从而确 定出相应的语义相似度,然后根据语义相似度进行优选出多个语音房作为所述的候选语音房即可,各个候选语音房相应的语义相似度,便可作为其相应的基础推荐评分。
在步骤S5200中,
每个所述的候选语音房,均可应用本申请前文的各个实施例任意之一确定判定其相应的质量类别,而为了体现所述质量类别的信息价值,可对应质量分类体系的各个质量类别分别预设用于调整所术推荐展示评分的权重,使得实际表征的信息质量越高,其权重越高,实际表征的信息质量越低,则其权重也越低。由此,实现对不同质量类别的量化评价。
对于每个候选语音房而言,采用其相应的质量类别的预设权重与其推荐展示评分进行相乘,所获得的乘积可作为其相应的推荐展示评分。由于所述权重已经按照不同质量类别进行量化,因此,所述推荐展示评分实质上是对所述推荐展示评分进行相应降权或提权的结果。
在步骤S5300中,
各个候选语音房均获得其相对应的推荐展示评分之后,便可根据该推荐展示评分对各个候选语音房进行倒排序,使质量更优的语音房排序靠前,根据倒排序结果,获得最终的语音房推荐列表。
在步骤S5400中,可将所述语音房推荐列表推送至提交所述语音房推荐请求的终端设备处,以完成对该请求的应答。其中,所述语音房推荐列表中,可封装相应的语音房的各项必要信息,包括但不限于相应的语音房的访问入口链接、语音房的简介等。所述终端设备获得所述的语音房推荐列表后,进行相应的解析,将其显示于图形用户界面中即可,如图11所示。
此处的实施例,示例性地展示了本申请所实现的质量类别识别能力服务于语音房推荐业务的过程,由此可见,在根据本申请精准及时确定语音房质量类别的情况下,平台在为其用户推荐相应的语音房时,能够按照语音房的信息价值择优推荐,从而可有效吸引用户驻存于平台中,也可为优质的语音房引流,优化了整个平台的语音房推荐逻辑,可期望获得良好的规模经济效用。
请参阅图12,根据本申请的一个方面提供的一种语音房质量评估装置,包括语音识别模块1100、文本编码模块1200,以及质量识别模块1300,其中:所述语音识别模块1100,配置为获取单位时间段内语音房中的语音流,从所述语音流中识别出说话文本;所述文本编码模块1200,配置为构造所述说话文本 的编码向量,该编码向量包含所述语音流的音源对象数量统计特征、发言总次数统计特征、所述说话文本中有效名词数量的统计特征;所述质量识别模块1300,配置为根据所述编码向量确定所述语音房的质量类别。
根据本申请变通的实施例中,所述语音识别模块1100,包括:分段处理子模块,配置为获取语音房即时生成的单位时间段的语音流;人声检测子模块,配置为对所述语音流进行人声检测,确定其中不同音源对象的人声片段;识别转换子模块,配置为对所述人声片段进行语音识别,获得各个人声片段相对应的说话文本。
根据本申请变通的实施例中,所述文本编码模块1200,包括:音源统计子模块,配置为获取所述单位时间段的语音流中的音源对象数量构成相应的统计特征;发言统计子模块,配置为获取所述单位时间段的语音流中的发言总次数构成相应的统计特征;名词统计子模块,配置为根据多个预设维度统计所述说话文本中的有效名词的数量构成相应的统计特征;编码构造子模块,配置为按预设顺序将所述各个统计特征构造为编码向量。
根据本申请变通的实施例中,所述名词统计子模块,包括:名词提取单元,配置为提取所述说话文本中的名词,获得名词集;名词过滤单元,配置为根据预设的停用词表过滤所述名词集以获得有效名词集;匹配统计单元,配置为根据预设的不同维度相应提供的匹配规则,确定每种匹配规则下有效名词集命中预设的基础名词表的名词命中数量,作为相应维度的统计特征。
根据本申请变通的实施例中,所述名词提取单元,包括:分词子单元,配置为对所述说话文本进行分词,获得分词集;向量化子单元,配置为将分词集中的分词编码为嵌入向量;词性识别子单元,配置为对所述嵌入向量提取深层语义信息,根据深层语义信息进行词性识别,确定各个分词相对应的词性;名词抽取子单元,配置为抽取其中词性为名词的分词构造为所述名词集。
根据本申请变通的实施例中,所述匹配统计单元包括:精准统计次级单元,配置为根据精准匹配规则,统计有效名词集中有效名词精准命中所述基础名词表中的基础名词相对应的名词命中数量,作为综合维度的统计特征;细分统计次级单元,配置为根据所述基础名词表中基础名词的预设分类,细分统计所述精准匹配规则下,精准命中各个预设分类相对应的名词命中数量,作为各个预设分类维度相对应的统计特征;模糊统计次级单元,配置为根据模糊匹配规则,统计有效名词集中有效名词未精准命中、而模糊命中所述基础名词表中的基础 名词的名词命中数量,作为相似维度的统计特征。
根据本申请变通的实施例中,所述模糊统计次级单元,包括:冗余构造子单元,配置为获取所述有效名词集中未精准命中所述基础名词表的有效名词构成冗余子集;相似计算子单元,配置为计算所述冗余子集内每个有效名词的向量与所述基础名词表中的每个基础名词的向量两两之间的语义相似度;筛选计数子单元,配置为对存在最高语义相似度超过预设阈值的有效名词进行计数,统计出模糊命中所述基础名词表的名词命中数量。
根据本申请变通的实施例中,所述质量识别模块1300,采用预先训练至收敛状态的神经网络分类模型实现,该神经网络分类模型由预设的训练装置执行训练任务至收敛状态,所述训练装置包括:样本调用模块,配置为调用预设的数据集中的单个训练样本,所述训练样本包括单位时间段的语音流及为该语音流标注的质量类别;语义提取模块,配置为经卷积神经网络对所述训练样本的语音流相应的所述编码向量提取深层语义信息;分类映射模块,配置为经分类器对所述深层语义信息进行分类映射,获得预测的质量类别;损失计算模块,配置为根据标注的质量类别计算预测的质量类别的模型损失值;迭代决策模块,配置为判断所述模型损失值是否达到预设阈值,当该模型损失值未达到预设阈值时对模型实施梯度更新,调用下一训练样本继续实施迭代训练,否则判定模型收敛,终止训练。
根据本申请变通的实施例中,后于所述质量识别模块1300,包括:请求响应模块,配置为响应终端设备提交的语音房推荐请求,根据预设推荐算法确定多个候选语音房及其相应的基础推荐评分;评分调整模块,配置为根据每个候选语音房相对应确定的所述质量类别的预设权重,调整相应的基础推荐评分而获得推荐展示评分;排序处理模块,配置为根据推荐展示评分对各个候选语音房进行倒排序,获得语音房推荐列表;应答推送模块,配置为应答语音房推荐请求,将所述语音房推荐列表推送至所述终端设备显示。
本申请的另一实施例还提供一种语音房识别设备,该语音房识别设备可采用计算机设备实现。如图13所示,计算机设备的内部结构示意图。该计算机设备包括通过系统总线连接的处理器、计算机可读存储介质、存储器和网络接口。其中,该计算机设备的计算机可读存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种语音房质量评估方法。
本实施方式中处理器配置为执行图12中的各个模块的具体功能,存储器存储有执行上述模块或子模块所需的程序代码和各类数据。网络接口配置为向用户终端或服务器之间的数据传输。本实施方式中的存储器存储有本申请的语音房质量评估装置中执行所有模块所需的程序代码及数据,服务器能够调用服务器的程序代码及数据执行所有模块的功能。
本申请还提供一种存储有计算机可读指令的存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行本申请任一实施例的语音房质量评估方法的步骤。
本申请还提供一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被一个或多个处理器执行时实现本申请任一实施例所述方法的步骤。
综上所述,本申请能够准确判别语音房所产生的语音流的质量类别,能够提升为平台用户推荐语音房的准确度,有助于活跃平台用户流量,提升平台用户驻存率。

Claims (11)

  1. 一种语音房质量评估方法,其中,包括如下步骤:
    获取单位时间段内语音房中的语音流,从所述语音流中识别出说话文本;
    构造所述说话文本的编码向量,该编码向量包含所述语音流的音源对象数量统计特征、发言总次数统计特征、所述说话文本中有效名词数量的统计特征;
    根据所述编码向量确定所述语音房的质量类别。
  2. 根据权利要求1所述的语音房质量评估方法,其中,所述构造所述说话文本的编码向量,包括如下步骤:
    获取所述单位时间段的语音流中的音源对象数量构成相应的统计特征;
    获取所述单位时间段的语音流中的发言总次数构成相应的统计特征;
    根据多个预设维度统计所述说话文本中的有效名词的数量构成相应的统计特征;
    按预设顺序将所述各个统计特征构造为编码向量。
  3. 根据权利要求2所述的语音房质量评估方法,其中,所述根据多个预设维度获取所述说话文本中的名词的数量构成相应的统计特征,包括如下步骤:
    提取所述说话文本中的名词,获得名词集;
    根据预设的停用词表过滤所述名词集以获得有效名词集;
    根据预设的不同维度相应提供的匹配规则,确定每种匹配规则下有效名词集命中预设的基础名词表的名词命中数量,作为相应维度的统计特征。
  4. 根据权利要求3所述的语音房质量评估方法,其中,所述提取所述说话文本中的名词,包括如下步骤:
    对所述说话文本进行分词,获得分词集;
    将分词集中的分词编码为嵌入向量;
    对所述嵌入向量提取深层语义信息,根据深层语义信息进行词性识别,确定各个分词相对应的词性;
    抽取其中词性为名词的分词构造为所述名词集。
  5. 根据权利要求3所述的语音房质量评估方法,其中,所述根据预设的不同维度相应提供的匹配规则,确定每种匹配规则下有效名词集命中预设的基础名词表的名词命中数量,作为相应维度的统计特征,包括如下步骤:
    根据精准匹配规则,统计有效名词集中有效名词精准命中所述基础名词表中的基础名词相对应的名词命中数量,作为综合维度的统计特征;
    根据所述基础名词表中基础名词的预设分类,细分统计所述精准匹配规则下,精准命中各个预设分类相对应的名词命中数量,作为各个预设分类维度相对应的统计特征;
    根据模糊匹配规则,统计有效名词集中有效名词未精准命中、而模糊命中所述基础名词表中的基础名词的名词命中数量,作为相似维度的统计特征。
  6. 根据权利要求5所述的语音房质量评估方法,其中,所述根据模糊匹配规则,统计有效名词集中有效名词未精准命中、而模糊命中所述基础名词表中的基础名词的名词命中数量,作为相似维度的统计特征,包括如下步骤:
    获取所述有效名词集中未精准命中所述基础名词表的有效名词构成冗余子集;
    计算所述冗余子集内每个有效名词的向量与所述基础名词表中的每个基础名词的向量两两之间的语义相似度;
    对存在最高语义相似度超过预设阈值的有效名词进行计数,统计出模糊命中所述基础名词表的名词命中数量。
  7. 根据权利要求1至6中任意一项所述的语音房质量评估方法,其中,所述根据所述编码向量确定所述语音房的质量类别的步骤之后,包括如下步骤:
    响应终端设备提交的语音房推荐请求,根据预设推荐算法确定多个候选语音房及其相应的基础推荐评分;
    根据每个候选语音房相对应确定的所述质量类别的预设权重,调整相应的基础推荐评分而获得推荐展示评分;
    根据推荐展示评分对各个候选语音房进行倒排序,获得语音房推荐列表;
    应答语音房推荐请求,将所述语音房推荐列表推送至所述终端设备显示。
  8. 一种语音房质量评估装置,其中,包括:
    语音识别模块,配置为获取单位时间段内语音房中的语音流,从所述语音流中识别出说话文本;
    文本编码模块,配置为构造所述说话文本的编码向量,该编码向量包含所述语音流的音源对象数量统计特征、发言总次数统计特征、所述说话文本中有效名词数量的统计特征;
    质量识别模块,配置为根据所述编码向量确定所述语音房的质量类别。
  9. 一种语音房识别设备,包括中央处理器和存储器,其中,所述中央处理器配置为调用运行存储于所述存储器中的计算机程序以执行如权利要求1至 7中任意一项所述的方法的步骤。
  10. 一种计算机可读存储介质,其中,其以计算机可读指令的形式存储有依据权利要求1至7中任意一项所述的方法所实现的计算机程序,该计算机程序被计算机调用运行时,执行相应的方法所包括的步骤。
  11. 一种计算机程序产品,其中,包括计算机程序/指令,该计算机程序/指令被处理器执行时实现权利要求1至7中任意一项所述方法的步骤。
PCT/CN2023/087339 2022-04-28 2023-04-10 语音房质量评估方法及其装置、设备、介质、产品 WO2023207566A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210470807.6 2022-04-28
CN202210470807.6A CN114841143A (zh) 2022-04-28 2022-04-28 语音房质量评估方法及其装置、设备、介质、产品

Publications (1)

Publication Number Publication Date
WO2023207566A1 true WO2023207566A1 (zh) 2023-11-02

Family

ID=82567325

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/087339 WO2023207566A1 (zh) 2022-04-28 2023-04-10 语音房质量评估方法及其装置、设备、介质、产品

Country Status (2)

Country Link
CN (1) CN114841143A (zh)
WO (1) WO2023207566A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841143A (zh) * 2022-04-28 2022-08-02 广州市百果园信息技术有限公司 语音房质量评估方法及其装置、设备、介质、产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679462A (zh) * 2012-08-31 2014-03-26 阿里巴巴集团控股有限公司 一种评论数据处理方法和装置、一种搜索方法和系统
CN107608964A (zh) * 2017-09-13 2018-01-19 上海六界信息技术有限公司 基于弹幕的直播内容的筛选方法、装置、设备及存储介质
CN108320101A (zh) * 2018-02-02 2018-07-24 武汉斗鱼网络科技有限公司 直播间运营能力评估方法、装置及终端设备
US20180308487A1 (en) * 2017-04-21 2018-10-25 Go-Vivace Inc. Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response
CN113064994A (zh) * 2021-03-25 2021-07-02 平安银行股份有限公司 会议质量评估方法、装置、设备及存储介质
CN114841143A (zh) * 2022-04-28 2022-08-02 广州市百果园信息技术有限公司 语音房质量评估方法及其装置、设备、介质、产品

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679462A (zh) * 2012-08-31 2014-03-26 阿里巴巴集团控股有限公司 一种评论数据处理方法和装置、一种搜索方法和系统
US20180308487A1 (en) * 2017-04-21 2018-10-25 Go-Vivace Inc. Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response
CN107608964A (zh) * 2017-09-13 2018-01-19 上海六界信息技术有限公司 基于弹幕的直播内容的筛选方法、装置、设备及存储介质
CN108320101A (zh) * 2018-02-02 2018-07-24 武汉斗鱼网络科技有限公司 直播间运营能力评估方法、装置及终端设备
CN113064994A (zh) * 2021-03-25 2021-07-02 平安银行股份有限公司 会议质量评估方法、装置、设备及存储介质
CN114841143A (zh) * 2022-04-28 2022-08-02 广州市百果园信息技术有限公司 语音房质量评估方法及其装置、设备、介质、产品

Also Published As

Publication number Publication date
CN114841143A (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
CN108052583B (zh) 电商本体构建方法
CN111125334B (zh) 一种基于预训练的搜索问答系统
WO2019153737A1 (zh) 用于对评论进行评估的方法、装置、设备和存储介质
CN111325029B (zh) 一种基于深度学习集成模型的文本相似度计算方法
WO2021114841A1 (zh) 一种用户报告的生成方法及终端设备
CN111177186B (zh) 基于问题检索的单句意图识别方法、装置和系统
CN111209363B (zh) 语料数据处理方法、装置、服务器和存储介质
CN110263854B (zh) 直播标签确定方法、装置及存储介质
CN112581006A (zh) 筛选舆情信息及监测企业主体风险等级的舆情引擎及方法
WO2021036439A1 (zh) 一种信访问题答复方法及装置
CN111930792A (zh) 数据资源的标注方法、装置、存储介质及电子设备
CN112926308B (zh) 匹配正文的方法、装置、设备、存储介质以及程序产品
CN112395421B (zh) 课程标签的生成方法、装置、计算机设备及介质
CN111080055A (zh) 酒店评分方法、酒店推荐方法、电子装置和存储介质
CN111061837A (zh) 话题识别方法、装置、设备及介质
WO2023207566A1 (zh) 语音房质量评估方法及其装置、设备、介质、产品
CN111651606B (zh) 一种文本处理方法、装置及电子设备
TWI828928B (zh) 高擴展性、多標籤的文本分類方法和裝置
TWI734085B (zh) 使用意圖偵測集成學習之對話系統及其方法
CN116756347A (zh) 一种基于大数据的语义信息检索方法
CN115017271B (zh) 用于智能生成rpa流程组件块的方法及系统
CN111460114A (zh) 检索方法、装置、设备及计算机可读存储介质
CN111241288A (zh) 一种大集中电力客户服务中心的突发事件感知系统以及构建方法
WO2023093116A1 (zh) 企业的产业链节点确定方法、装置、终端及存储介质
CN116978367A (zh) 语音识别方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794999

Country of ref document: EP

Kind code of ref document: A1