CN114841143A - Voice room quality evaluation method and device, equipment, medium and product thereof - Google Patents

Voice room quality evaluation method and device, equipment, medium and product thereof Download PDF

Info

Publication number
CN114841143A
CN114841143A CN202210470807.6A CN202210470807A CN114841143A CN 114841143 A CN114841143 A CN 114841143A CN 202210470807 A CN202210470807 A CN 202210470807A CN 114841143 A CN114841143 A CN 114841143A
Authority
CN
China
Prior art keywords
voice
noun
nouns
room
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210470807.6A
Other languages
Chinese (zh)
Inventor
李益永
温偲
陈建强
陈德健
项伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Baiguoyuan Information Technology Co Ltd
Original Assignee
Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Baiguoyuan Information Technology Co Ltd filed Critical Guangzhou Baiguoyuan Information Technology Co Ltd
Priority to CN202210470807.6A priority Critical patent/CN114841143A/en
Publication of CN114841143A publication Critical patent/CN114841143A/en
Priority to PCT/CN2023/087339 priority patent/WO2023207566A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method for evaluating the quality of a voice room in the technical field of instant messaging, and a device, equipment, a medium and a product thereof, wherein the method comprises the following steps: acquiring a voice stream in a voice room in a unit time period, and identifying a speaking text from the voice stream; constructing a coding vector of the speaking text, wherein the coding vector comprises the statistical characteristics of the number of sound source objects, the statistical characteristics of the total speaking times and the statistical characteristics of the number of effective nouns in the speaking text of the voice stream; and determining the quality class of the speech room according to the coding vector. The quality category of the voice stream generated by the voice room can be accurately judged, the accuracy of recommending the voice room for the platform user can be improved, the flow of the platform user can be activated, and the parking rate of the platform user can be improved.

Description

Voice room quality evaluation method and device, equipment, medium and product thereof
Technical Field
The application relates to the technical field of instant messaging, in particular to a voice room quality assessment method and a device, equipment, medium and product thereof.
Background
In a network interaction scene, users of a live broadcast platform can communicate with each other in a voice mode, so that a live broadcast room with the property of instant communication is derived, specifically, the live broadcast room can be a special voice room, the users in the voice room can achieve the application purposes of topic discussion, talent and skill display, information sharing, knowledge education and the like, and the overall social benefit can be promoted.
The live broadcast platform usually supports massive voice rooms, and different voice rooms have different presentation qualities due to the fact that speaking contents of speaking users in the voice rooms are different. Due to the requirement of recommending the voice room to the platform user, the platform can assist in screening the high-quality voice room by means of the voice room quality evaluation technology.
The traditional voice room quality evaluation technology adopts a voice characteristic input preset model for identification or adopts information after voice is converted into characters for identification, and the evaluation effect of the technology is poor in practice.
In view of the above, there is still room for improvement in the speech room quality evaluation technology, which has a fundamental role in improving the service of the live broadcast platform.
Disclosure of Invention
An object of the present application is to solve the above-mentioned problems and provide a speech room quality assessment method and a corresponding apparatus, a speech room recognition device, a computer-readable storage medium, and a computer program product.
According to an aspect of the present application, there is provided a speech room quality assessment method, including the steps of:
acquiring a voice stream in a voice room in a unit time period, and identifying a speaking text from the voice stream;
constructing a coding vector of the speaking text, wherein the coding vector comprises the statistical characteristics of the number of sound source objects of the voice stream, the statistical characteristics of the total speaking times and the statistical characteristics of the number of effective nouns in the speaking text;
and determining the quality category of the speech room according to the coding vector.
According to another aspect of the present application, there is provided a speech room quality assessment apparatus including:
the voice recognition module is used for acquiring a voice stream in a voice room in a unit time period and recognizing a speaking text from the voice stream;
the text coding module is used for constructing a coding vector of the speaking text, wherein the coding vector comprises the sound source object number statistical characteristic, the total speaking frequency statistical characteristic and the effective noun number statistical characteristic of the speaking text of the voice stream;
and the quality identification module is used for determining the quality category of the speech room according to the coding vector.
According to another aspect of the present application, there is provided a speech room recognition apparatus, comprising a central processing unit and a memory, wherein the central processing unit is used for invoking and running a computer program stored in the memory to execute the steps of the speech room quality assessment method described in the present application.
According to another aspect of the present application, there is provided a computer-readable storage medium storing a computer program implemented according to the speech room quality assessment method in the form of computer-readable instructions, which, when called by a computer, performs the steps included in the method.
According to another aspect of the present application, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method described in any one of the embodiments of the present application.
Compared with the prior art, the method and the device have the advantages that the speaking text is recognized according to the voice stream generated in the unit time period of the speech room, then the statistical characteristics of the number of the sound source objects, the statistical characteristics of the total speaking times and the statistical characteristics of the number of the effective nouns in the speaking text are constructed into the coding vector, the quality category corresponding to the section of the speech stream is determined by utilizing the deep semantic information of the coding vector, the data used for constructing the coding vector are various statistical characteristics corresponding to the speaking text instead of depending on the original audio characteristics or the original speaking text, the activity of the speech room can be represented by the two statistical characteristics of the number of the sound source objects and the total speaking times, the content quality of the speech room can be represented by the statistical characteristics of the nouns in the speaking text, and the effective primary representation of the speech stream is realized by the constructed coding vector, the voice room recommendation system comprises multi-mode information, the quality category determined according to deep semantic information is more accurate and reliable on the basis, and scientific and reliable basic data can be provided for a platform recommendation voice room.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a network architecture corresponding to a speech room operating environment applied in the present application;
FIG. 2 is a schematic flow chart illustrating an embodiment of a speech room quality assessment method according to the present application;
FIG. 3 is a flow chart illustrating a process of recognizing spoken text based on a voice stream according to an embodiment of the present application;
FIG. 4 is a flow chart illustrating a process of constructing a code vector according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating a process of obtaining statistical characteristics according to word segmentation of a spoken text according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating a process of performing word segmentation on the spoken text to obtain a set of scored words according to an embodiment of the present application;
FIG. 7 is a flowchart illustrating a process of determining each statistical feature according to the set of valid names of the spoken text in an embodiment of the present application;
FIG. 8 is a flowchart illustrating a process of fuzzy matching the redundant subset of the valid noun set to count the number of noun hits according to an embodiment of the present application;
FIG. 9 is a flowchart illustrating a training process of a neural network classification model for determining a quality class to which a code vector is mapped according to an embodiment of the present application;
FIG. 10 is a flowchart illustrating a process of pushing a recommended voice room list in response to a request for recommended voice room in an embodiment of the present application;
FIG. 11 is an exemplary graphical user interface for presenting a recommended voice room list according to the present application;
fig. 12 is a functional block diagram of the speech room quality evaluation apparatus of the present application;
fig. 13 is a schematic structural diagram of a speech room recognition apparatus used in the present application.
Detailed Description
The term "server" as used herein may be extended to apply to a service cluster. According to the network deployment principle understood by those skilled in the art, the servers should be logically divided, and in physical space, the servers may be independent from each other but can be called through an interface, or may be integrated into one physical computer or a set of computer clusters. Those skilled in the art will appreciate this variation and should not be so limited as to restrict the implementation of the network deployment of the present application.
The person skilled in the art will know this: although the various methods of the present application are described based on the same concept so as to be common to each other, they may be independently performed unless otherwise specified. In the same way, for each embodiment disclosed in the present application, it is proposed based on the same inventive concept, and therefore, concepts of the same expression and concepts of which expressions are different but are appropriately changed only for convenience should be equally understood.
The embodiments to be disclosed herein can be flexibly constructed by cross-linking related technical features of the embodiments unless the mutual exclusion relationship between the related technical features is stated in the clear text, as long as the combination does not depart from the inventive spirit of the present application and can meet the needs of the prior art or solve the deficiencies of the prior art. Those skilled in the art will appreciate variations therefrom.
Please refer to the network architecture shown in fig. 1, which can be used to deploy the computer program product obtained by implementing the embodiments of the present application to provide a voice room service, and construct an online voice room through the service, so that users in the voice room can perform online interaction. It should be noted that, in the conventional live broadcast room in the network live broadcast, because there exists a voice stream, it can also be regarded as a specific form of the voice room described in the present application.
The application server 81 shown in fig. 1 may be used to support the implementation of the voice room, and the media server 82 may be used to handle the forwarding of the voice stream of each voice room, wherein a terminal device such as a computer 83 and a mobile phone 84, as a client, is generally provided for users of the voice rooms, and a graphical user interface is provided for corresponding users through a front-end page or an application program matched with the voice room service so as to implement human-computer interaction.
Referring to fig. 2, a method for evaluating speech room quality according to an aspect of the present application, in an embodiment, includes the following steps:
step S1100, acquiring a voice stream in a voice room in a unit time period, and identifying a speaking text from the voice stream;
in an exemplary application scenario, a voice room service of a live broadcast platform concurrently runs a large number of voice rooms, voice data generated by each voice room is uploaded to the media server in a streaming media format, and the media server pushes corresponding voice streams to terminal devices of receiving users of the corresponding voice rooms, so that support for instant messaging of the voice rooms is achieved. Therefore, the voice stream corresponding to the voice room can be obtained from the media server.
In order to process the voice stream conveniently, a unit time period is preset, for example, 20 minutes or 30 minutes, and those skilled in the art can flexibly set an appropriate time length as long as an appropriate amount of voice content can be obtained in the unit time period. When the voice stream of one voice room is processed each time, the corresponding voice stream can be taken for processing by taking the unit time period as the backtracking duration based on the backtracking of the current moment, that is, the voice stream generated in the unit time period can be processed every other unit time period. Thus, the voice stream continuously generated in the voice room can be recognized in a staged manner.
Then, any feasible speech recognition technology is adopted to perform speech-to-text recognition on the speech stream corresponding to the unit time period, so that the corresponding speaking text can be obtained. The spoken text generally includes spoken sentences corresponding to each audio source object in the audio stream.
Step S1200, constructing a coding vector of the speaking text, wherein the coding vector comprises the statistical characteristics of the number of sound source objects, the statistical characteristics of the total speaking times and the statistical characteristics of the number of effective nouns in the speaking text of the voice stream;
in order to achieve a preliminary representation of the overall quality of the spoken text, a plurality of statistical features of the speech stream corresponding to the unit time period may be used to construct a corresponding encoding vector. The statistical characteristics comprise statistical characteristics characterized by the number of sound source objects in the sound stream, statistical characteristics characterized by the total number of times of speaking in the sound stream, and statistical characteristics characterized by effective nouns in the speaking text.
The number of sound source objects refers to the total number of users who have effectively spoken in the unit time period in the speech room, and may be obtained by the speech room service, for example, by monitoring the speech behavior performed by each user in the unit time period and submitting corresponding audio data to confirm and count the data, or by performing sound source separation on the speech stream using any feasible sound source separation technique. And as such, may be flexibly implemented by those skilled in the art in accordance with the principles disclosed herein. It is understood that the larger the number of sound source objects, the larger the size of the speaking user of the speech room.
The total number of speeches refers to the total number of effective speeches in the unit time period in the speech room, and similarly, the total number of effective speeches in the unit time period can be obtained by the speech room service, for example, the total number can be obtained by monitoring corresponding audio data submitted by each speech behavior in the unit time period to confirm and count the corresponding audio data, and for example, any feasible human voice detection technology can be adopted to identify human voice segments of a plurality of sound sources. And as such, may be flexibly implemented by those skilled in the art in accordance with the principles disclosed herein. It will be appreciated that the greater the total number of utterances, the more active the speech room is communicating.
The statistical characteristic of the number of effective nouns in the spoken text refers to data obtained by counting the nouns which are matched with nouns confirmed to be effective in advance and exist in the spoken text, and one or more counts obtained by matching each noun in the spoken text with the basic noun table in one or more modes can be used as corresponding statistical characteristics by providing the basic noun table which is formed by manually marked nouns as basic nouns in advance. It can be understood that the larger the number of valid nouns in the spoken text, the more valuable the corresponding information is.
Therefore, the statistical characteristics of the number of sound source objects, the total number of speeches, the number of effective nouns in the speeched text and the like realize the representation of the scale of the speaking user, the speaking activity and the information value contained in the speeches of the speech room in a quantized mode, the representation is constructed into a coding vector corresponding to the speeched text, and the coding vector forms a preliminary representation of the quality information of the speeched text.
And step S1300, determining the quality category of the speech room according to the coding vector.
In the present application, a quality classification space is constructed in advance, and a plurality of quality categories are contained in the quality classification space, and the number of the categories can be set according to needs, for example, the categories represent three categories of "high, medium and low", or the categories represent four categories of "wonderful, high-quality, common and low", and the like, which can be set by those skilled in the art. On this basis, determining the quality class of its mapping from the code vector can be achieved in a number of ways.
In one mode, a quantity mapping relationship from each statistical feature to each quality category in the coding vector may be constructed based on a mathematical model, for example, each statistical feature is subjected to weighted normalization to obtain a sum value, the sum value is matched with a threshold interval preset for each quality category, and the quality category of which the threshold interval is matched with the sum value is determined as the quality category mapped by the coding vector, that is, the quality category corresponding to the voice stream of the unit time period. The method is simple in calculation, small in calculation amount, beneficial to saving system overhead and capable of improving response speed.
In another mode, a decision tree algorithm may be applied based on the conventional machine learning principle, and a mathematical model is established by using any optimization algorithm such as ID3, CART, GBDT, XGB, and the like to solve according to the encoding vector, so as to obtain the mapped quality class, which is specifically exemplified as follows:
let X ═ X1, X2, …, X7), X ij Is the jth 20-minute feature of the ith speech room, y ij The jth 20-minute label for the ith room, y ij 2 denotes the speech room belonging to the high quality category at the jth 20 min of the ith room, y ij 1 denotes the speech room belonging to the common quality class at the jth 20 min of the ith room, y ij 0 denotes a speech room belonging to the low quality category at jth 20 minutes of the ith room. Mixing X ij Randomly ordering to obtain a training set V ═ Z 1 ,Z 2 ,…,Z m ) Where m is the number of samples, Q i Is Z i A corresponding label.
It can be known that the optimized mathematical model established by the decision tree algorithm is as follows:
Figure BDA0003622335490000061
Figure BDA0003622335490000062
the solution is performed by using the XGB algorithm, but other known algorithms may be used, and those skilled in the art may select the algorithm flexibly. It can be seen that, since each statistical feature in the coding vector is generated based on a numerical value, the method for solving the quality category of the speech room is convenient to analyze efficiently and quickly, and the overall implementation cost is saved.
The example ID3 algorithm is a decision tree algorithm, and the core principle of the ID3 algorithm is to select features to be partitioned according to information gain and then recursively construct a decision tree.
The CART algorithm is an example, which is called Classification And Regression Trees in all its english terms, And as the name implies, the CART algorithm can be used for both Classification And Regression.
An exemplary GBDT algorithm, which is called a Gradient Boosting Decision Tree in english, is an integrated algorithm based on a Decision Tree. Wherein, the Gradient Boosting is an algorithm in the Boosting of the integration method, and a new learner is iterated through Gradient descent.
An exemplary XGB algorithm, also called XGBoost algorithm, is one of ensemble learning methods using CART as a base classifier, and is widely applied to data modeling competitions due to its excellent computational efficiency and prediction accuracy.
In another mode, based on the deep learning principle, a neural network model is used as a basic model to extract deep semantic information from the coded vector, the deep semantic information is mapped into a quality classification space by a classifier, and according to the classification probability obtained by each quality category in the quality classification space, the quality category corresponding to the maximum classification probability is taken as the reference, so as to determine the quality category corresponding to the voice stream in the unit time period. Of course, the neural network model should be pre-trained to a converged state by one skilled in the art using a sufficient number of training samples. The basic model can be implemented by using a CNN (Convolutional Neural Network), an RNN (recurrent Neural Network), and the like, and the classifier can be constructed by using a Softmax () function, and those skilled in the art can flexibly select the model according to the principle disclosed herein. It is easy to see that, by adopting the mode, the semantic relevance among all the statistical characteristics is considered, and the method is suitable for providing large-scale service.
The richness of the mathematical models for solving the quality categories corresponding to the coding vectors can be seen, the information on which the coding vectors are constructed is carried out based on numerical information, effective data is provided for mathematical modeling, the modeling is convenient and rapid, the convergence of the models is promoted, the cost for solving the problems can be saved, and the efficiency for solving the quality categories of the voice room is improved.
According to the embodiments disclosed herein, compared to the prior art, the present application recognizes the speaking text according to the voice stream generated in the unit time slot of the speech room, then constructs the statistical characteristics of the number of sound source objects, the statistical characteristics of the total number of speaking times, and the statistical characteristics of the number of valid nouns in the speaking text as the coding vector, and then determines the quality category corresponding to the speech stream by using the deep semantic information of the coding vector, since the data used to construct the coding vector is the various statistical characteristics corresponding to the speaking text, rather than depending on the original audio characteristics or the original speech text, the liveness of the speech room can be represented by the two statistical characteristics of the number of sound source objects and the total number of speaking times, and the content quality of the speech room can be represented by the statistical characteristics of the nouns in the speaking text, and the coding vector thus constructed realizes the effective preliminary representation of the speech stream, the voice room recommendation system comprises multi-mode information, the quality category determined according to deep semantic information is more accurate and reliable on the basis, and scientific and reliable basic data can be provided for a platform recommendation voice room.
Referring to fig. 3, according to an embodiment of the present application, the step S1100 of obtaining a voice stream in a voice room in a unit time period and recognizing a speaking text from the voice stream includes the following steps:
step S1110, acquiring a voice stream of a unit time period generated by a voice room in real time;
the voice room can be collected in real time, the voice flow generated immediately can be collected in real time, the corresponding length of the unit time period is taken as a time unit, and real-time analysis is started on the voice flow in the unit time period, so that the speed of judging the quality category of the voice room is further improved, and the real-time quality information of the voice room is reflected more quickly.
Step S1120, performing voice detection on the voice stream, and determining voice segments of different sound source objects;
and detecting Voice Activity of the audio data in the Voice stream by adopting a Voice Activity Detection (VAD) statistical model so as to remove mute information in the Voice Activity, and determining the audio data with the VAD threshold value exceeding a preset threshold value as a Voice segment, thereby obtaining the Voice segment corresponding to each speech. Since the voice room service usually performs sound source separation on the voice stream in advance, or the sound source separation algorithm may be adopted by the application to realize sound source separation, the voice segments may be determined according to different sound source objects.
Step S1130, voice recognition is carried out on the voice segments, and speaking texts corresponding to the voice segments are obtained.
Then, for each vocal segment, any feasible Speech Recognition model implemented based on Automatic Speech Recognition technology (ASR), such as the Wenet model, is used to perform Speech Recognition on the vocal segment, and the Speech Recognition model is converted into a spoken text, so as to obtain the spoken text corresponding to each vocal segment.
The embodiment can rapidly acquire the corresponding speaking text by performing real-time voice analysis on the voice stream generated in the voice room, filters most invalid information in the voice stream, greatly reduces the influence of environmental noise on the quality judgment of the voice room, and enables the quality classification process of the voice room to be faster.
Referring to fig. 4, according to an alternative embodiment of the present application, the step S1200 of constructing the encoding vector of the spoken text includes the following steps:
step S1210, obtaining the number of sound source objects in the voice stream of the unit time period to form corresponding statistical characteristics;
the number of the sound source objects can be predetermined by the voice room service, and can be directly obtained through interface calling, or can be determined through real-time analysis of the voice stream in the unit time period by adopting any feasible sound source separation technology, and in any case, the number of the speaking users in the voice stream generated in the unit time period is determined, so that the corresponding number of the sound source objects is also determined, and the number is taken as one of the statistical characteristics, and the total scale of the speaking users in the voice room can be represented.
Step S1220, acquiring the total speaking times in the voice stream of the unit time period to form corresponding statistical characteristics;
as to the total number of speeches in the voice stream of the unit time period, in one mode, when the voice room service is responsible for storing user behavior data corresponding to each speech of each user in the voice room, the total number of speeches may be obtained through statistics according to the user behavior data, and in another mode, in combination with a mode of performing voice segment detection by using VAD in the embodiment of the present application, the total number of voice segments may be directly determined as the total number of speeches, so that the total number of speeches in the voice stream of the unit time period is determined, and is used as one of the statistical features to represent the activity degree of the speech of the user in the unit time period in the voice room.
Step S1230, counting the number of effective nouns in the speaking text according to a plurality of preset dimensions to form corresponding statistical characteristics;
any multiple dimensions can be set, the number of effective nouns in the spoken text can be examined from different modes or granularities respectively, and the number under each dimension is taken as a corresponding statistical characteristic, so that the information value in the spoken text can be represented from different modes or different granularities.
For example, in one mode, a given basic noun table may be referred to, where the basic noun table includes artificially labeled nouns as basic nouns in advance, then matching basic nouns are found in the basic noun table for each noun in the speech text according to different matching modes, and when a matching basic noun is found, the number of effective nouns in the matching mode is increased by 1 unit, where each matching mode corresponds to one dimension, so as to determine the corresponding number of effective nouns in different dimensions.
In another mode, on the basis of the basic noun table, finer-grained labeling may be performed on the basic nouns therein, a preset classification is correspondingly set for each basic noun according to a preset classification standard, and then the number of effective nouns hit by each preset classification in the noun in the spoken text is counted as a statistical feature of the corresponding subdivision granularity.
The classification criteria may be, for example, classified according to the information value of the nouns and the recommendation purposes served thereby, for example, in the classification criteria established for serving commodity recommendation, the corresponding preset classifications are set as "common nouns", "associated nouns" and "commodity nouns", where the common nouns may correspond to common living nouns, such as "life", "poetry", "distant", and the like; the associated nouns may correspond to nouns related to the user's shopping needs, such as "subscription," "credit card," "store," etc.; the term may correspond to a specific name of a product, such as "shirt," "mobile phone," "computer," etc. Therefore, based on different service purposes, corresponding classification standards can be formulated to set corresponding classifications for the basic nouns of the basic noun table, and accordingly finer-grained information value labeling is provided for the basic nouns.
In alternative ways, the first two ways may be combined flexibly as desired, and may be selected flexibly by those skilled in the art based on the principles disclosed herein.
It is easy to understand that, in the basic name word list for determining the effective nouns in the spoken text, each basic noun is labeled in advance to be endowed with information value, and especially when the basic nouns in the basic name word list are labeled with preset classification, the information classification value is further combined, so that the obtained statistical characteristics under each dimension can realize effective representation of the information value of the voice stream in the unit time period from the perspective of different information values.
And S1240, constructing the statistical characteristics into coding vectors according to a preset sequence.
Finally, after obtaining a plurality of statistical features, each statistical feature can be constructed into a coding vector according to a certain preset sequence, and the preset sequence can be determined according to the reference of a mathematical model for solving the quality category mapped by the coding vector.
The embodiment exemplarily discloses a construction process of a coding vector, and accordingly, constructing a coding vector is also a process of preliminarily representing the information value of a voice stream in a unit time period in a speech room, and effectively represents the information value of the voice stream by using a plurality of numerical statistical features, so that the coding vector has a technical basis for solving the corresponding quality category, and provides important basic information for guiding a mathematical model to accurately solve the quality category of the speech room.
Referring to fig. 5, in an embodiment of the present application, the step S1230, obtaining the number of nouns in the spoken text according to a plurality of preset dimensions to form a corresponding statistical characteristic, including the following steps:
s1231, extracting nouns in the speaking text to obtain a noun set;
in the full-speech text corresponding to the speech stream in the unit time period obtained by speech recognition and text conversion, some expressions with weak information value may exist, and in consideration of the fact that the nouns in the language expression have a large effect, for this situation, necessary natural language processing is performed on the speech text to obtain the nouns therein, and the nouns are constructed into a noun set.
S1232, filtering the name word set according to a preset stop word list to obtain an effective name word set;
in order to extract the validity of nouns in a noun set, text preprocessing may be performed on the noun set, for example, a preset deactivation word list is referred to, preset deactivation words such as "the", "is", "which", "who" and "o" in the noun set are removed to implement purification, and the valid noun set is obtained after purification.
Step S1233, determining the number of hits of the valid noun set in the preset basic noun table under each matching rule according to the matching rules provided correspondingly in the preset different dimensions, as the statistical characteristics of the corresponding dimensions.
Referring to the previous embodiment, on the basis of the valid noun set, the corresponding matching rules may be determined according to different preset dimensions, then, according to the matching rules, each noun in the valid noun set is matched with a basic noun in the basic noun table, and the valid nouns in which matching is achieved are counted to determine the hit number of the corresponding nouns, which is used as the statistical feature of the corresponding dimensions.
In the embodiment, the nouns in the spoken text are extracted to construct the noun set, then stop word filtering is performed, and then the statistical characteristics corresponding to the spoken text required by the coding vector are constructed according to the filtered effective noun set, so that the accuracy and effectiveness of representing the information value by each statistical characteristic are improved, and the coding vector can better guide a mathematical model to perform speech room quality type judgment.
Referring to fig. 6, according to an embodiment of the present application, the step S1231 of extracting nouns in the spoken text includes the following steps:
step S2311, segmenting the speaking text to obtain a segmented word set;
the segmentation of the spoken text can be realized by various statistical-based segmentation algorithms, for example, by using an N-Gram algorithm, and binary or ternary segmentation of the spoken text is performed to obtain a corresponding segmentation set.
Step S2312, encoding the participles in the participle set into embedded vectors;
in order to facilitate semantic extraction on the participle set to determine the part-of-speech of each participle, each participle in the participle set may be encoded using any feasible vector encoding model, such as Word2Vec, and converted into a corresponding embedded vector.
Step S2313, deep semantic information is extracted from the embedded vector, part-of-speech recognition is carried out according to the deep semantic information, and parts-of-speech corresponding to each participle is determined;
then, part-of-speech recognition may be performed on each segmented word of the segmented word set on the basis of the embedded vector, and when performing semantic recognition, any feasible neural network model based on deep learning may be used, for example, any model implemented by using architectures such as LSTM + CRF, Bert + CRF, and the like, the LSTM or Bert basic model thereof performs representation learning on the embedded vector to obtain corresponding deep semantic information, and then performs part-of-speech recognition on the embedded vector by using CRF (conditional random field), so that parts-of-speech corresponding to each segmented word may be segmented, where the parts-of-speech are set according to grammatical parts-of-speech, for example: nouns, adjectives, adverbs, pronouns, and the like.
Step S2314, extracting participles with parts of speech as nouns and constructing the participles into the noun set.
In order to construct the noun set, the participles belonging to the noun in the participle set are extracted and constructed into the noun set.
According to the embodiment, the finally obtained noun set has the effect of accurately representing the value of the information content of the speech room through the links of word segmentation, coding, part-of-speech recognition, keyword extraction and the like for the speaking text corresponding to the speech stream in the unit time period, the coding vector is determined on the basis, and a very solid data mining foundation is laid for guiding the mathematical model to solve the quality category of the speech room.
Referring to fig. 7, according to an alternative embodiment of the present application, in step S1233, determining, according to matching rules correspondingly provided in different preset dimensions, a number of nouns hit in a preset basic noun table by a valid noun set under each matching rule, as a statistical feature of the corresponding dimension, includes the following steps:
step S2331, according to the accurate matching rule, counting the number of hit nouns corresponding to the basic nouns in the basic noun table, which are accurately hit by the effective nouns in the effective noun set, and taking the number as the statistical feature of the comprehensive dimensionality;
the valid name word set obtained according to the embodiment of the application is used as basic data for constructing the speaking text under each preset dimension, and different dimensions can be adapted to different matching rules. Therefore, the step is based on the precise matching rule, each effective noun in the effective noun set is matched with the basic noun in the basic noun table, so that the number of effective nouns hitting the basic noun table is determined, and the effective nouns are used as the statistical characteristics under the precise matching rule to represent the statistical characteristics determined from the comprehensive dimension.
When the accurate matching rule is applied, all the effective nouns to be matched are matched with all the basic nouns in the basic noun list, when the character strings of the effective nouns and the character strings of the basic nouns are the same, the effective nouns and the character strings of the basic nouns are confirmed to be matched, and the hit number of the corresponding nouns is accumulated to be 1 unit. Since the basic noun table is pre-labeled to have a corresponding information value as described above, the more effective nouns matching the basic noun table in the comprehensive dimension, the higher the comprehensive information value of the effective noun set.
Step S2332, according to the preset classification of the basic nouns in the basic noun list, under the accurate matching rule, the hit number of the nouns corresponding to each preset classification is accurately hit and counted in a subdivision mode to serve as the statistical feature corresponding to each preset classification dimension;
according to the disclosure of the foregoing embodiment of the application, each basic noun in the basic noun table can be classified according to a certain classification standard, so as to provide a classification information value with a finer granularity for the basic noun, and thus, based on the precise matching rule, the valid nouns in the basic noun table are collectively hit on the valid nouns in the basic noun table, and are classified and summarized according to the preset classification, so that the hit number of the valid nouns in the valid noun table hitting on each preset classification can be obtained, and can be used as a statistical feature corresponding to each preset classification dimension.
Because the preset classification comprises the indication function of the subdivision granularity, the statistical characteristics determined under the dimensionality of each preset classification effectively characterize the abundance of the information value of each preset classification.
Step S2333, according to fuzzy matching rules, counting noun hit numbers of effective nouns in the effective noun set which are not hit accurately and are hit in a fuzzy manner to basic nouns in the basic noun table, and taking the number as a statistical feature of similar dimensions;
then, for a part of valid nouns in the valid noun set which do not precisely hit the basic noun table according to the precise matching rule, a fuzzy matching rule can be further applied to the part of valid nouns, the part of valid nouns are matched with the basic nouns in the basic noun table again, the basic nouns corresponding to the part of valid nouns matched from the basic noun table are used as synonyms of the part of valid nouns, and then the total amount of the synonyms, namely the noun hit number determined based on the similarity dimension, is counted and used as corresponding statistical characteristics.
The fuzzy matching rules can be wildcarded by adopting a traditional fuzzy rule matching algorithm, can also be semantically matched by adopting a neural network model based on deep learning, and can be flexibly set by a person skilled in the art. It is understood that, in the total number of effective nouns which are not accurately hit in the basic noun table, only a part of effective nouns may be able to implement fuzzy matching with the basic noun table, and in any case, the finally determined number of synonyms, that is, the hit number of nouns determined by fuzzy matching, is able to characterize the information value of a part of effective nouns included in the effective noun set from the closeness degree of the nouns, thereby implementing effective characterization of the information value in the form of corresponding statistical characteristics.
According to the embodiment disclosed herein, it can be seen that, when determining the corresponding statistical characteristics based on the valid name word set of the spoken text, not only the case that valid names accurately hit the basic name word list, but also the case that valid names fuzzy hit the basic name word list are considered, not only the comprehensive case that the basic name word list is accurately hit is considered, but also the specific case that each preset classification in the basic name word list is accurately hit is considered, so that the statistical characteristics are respectively extracted from different dimensions and different sides, and the statistical characteristics are the statistical characteristics corresponding to the valid names in the spoken text, which can represent the corresponding information value, so that the subsequently obtained coding vectors can more accurately represent the effective information of the quality class of the speech room.
Referring to fig. 8, according to an alternative embodiment of the present application, the step S2333 of counting nouns hit number of valid nouns in the valid noun set that are not hit precisely and hit fuzzily on the basic nouns in the basic noun table according to the fuzzy matching rule, as the statistical feature of the similarity dimension, includes the following steps:
step S3331, obtaining the effective nouns in the effective noun set which do not accurately hit the basic noun table to form a redundant subset;
referring to the previous embodiment, after the valid term set is matched with the basic term table by using an accurate matching rule, a part of valid terms not accurately matched with the basic term table can be determined, and the part of valid terms can be additionally constructed as a redundant subset of the valid term set to facilitate subsequent operations.
Step S3332, calculating semantic similarity between the vector of each effective noun in the redundant subset and the vector of each basic noun in the basic noun table;
in this embodiment, a pre-trained to a convergent state text feature extraction model is used to perform representation learning on each valid noun in the redundant subset and each basic noun in the basic noun table, wherein the vectors represent deep semantic information of the basic noun. The text feature extraction model is realized by adopting a neural network model, for example, any basic network model suitable for extracting text features such as Fastext, AlBert and the like can be adopted. The classifier can be accessed by a person skilled in the art to perform fine tuning training on the effective nouns and the basic nouns as required, so that the vectors corresponding to the deep semantic information of the effective nouns and the basic nouns can be learned accurately.
Step S3333, counting the effective nouns with the highest semantic similarity exceeding a preset threshold, and counting the noun hit number of fuzzy hit in the basic noun list.
Then, based on the vector of each effective noun in the redundant subset, calculating the semantic similarity between the vector of the effective noun and the vector of each basic noun in the basic noun table, thereby obtaining a similarity matrix, wherein the value stored in each element in the matrix represents the semantic similarity between the effective noun corresponding to the row where the element is located and the basic noun corresponding to the column where the element is located, and the semantic similarity is represented in a matrix form, so that the fast operation is facilitated.
The semantic similarity between every two vectors can be calculated by any feasible data distance algorithm, including but not limited to cosine similarity algorithm, euclidean distance algorithm, pearson correlation coefficient algorithm, and jaccard coefficient algorithm. After calculation, the corresponding calculation results are appropriately normalized, so that the larger the representation value of the calculation results is, the more similar the two vectors are, and the corresponding semantic similarity value can be obtained and stored in the similarity matrix.
In the similarity matrix, for each valid noun, the semantic similarity corresponding to each basic noun can be used to determine whether the valid noun matches one of the basic nouns. Specifically, a preset threshold may be provided as a measure for determining whether the similarity satisfies the matching threshold, and then, for the basic noun corresponding to the element with the highest semantic similarity, the similarity value is compared with the preset threshold, and when the current value exceeds the latter, it may be determined that two vectors form a match, that is, the valid noun matches the basic noun, and for this reason, the number of noun hits in the similar dimension may be added by 1 unit, and when the former does not exceed the latter, it may be determined that two vectors do not form a match. Whether fuzzy matching is realized between each effective noun and the basic noun table is determined by adopting the principle, and finally the hit number of nouns obtained after traversing the full amount of effective nouns of the similarity matrix is the statistical characteristic under the similar dimension.
According to the embodiment disclosed herein, when the statistical characteristics of the spoken text in the similar dimension are determined, part of valid nouns not accurately hit in the basic noun table are subjected to fuzzy matching with the basic nouns in the basic noun table based on semantic similarity, so that the number of corresponding synonyms, that is, the number of hit nouns in the similar dimension is determined as the corresponding statistical characteristics, and accordingly, deeper data mining of the value of the information of the nouns in the valid noun set is realized by means of semantic similarity, missing of important information is avoided, and the value of the synonyms can be more scientifically and sufficiently represented by the corresponding statistical characteristics, so that the subsequent speech room category determination can be guided to obtain more accurate determination results.
Referring to fig. 9, according to an alternative embodiment of the present application, the determining the quality class of the coding vector may be implemented by using a neural network model based on deep learning, for this purpose, the step S1300 of determining the quality class of the speech room according to the coding vector is implemented by using a neural network classification model trained in advance to a convergence state, and a training process of the neural network classification model includes the following steps:
step S4100, calling a single training sample in a preset data set, wherein the training sample comprises a voice stream in a unit time period and a quality category labeled for the voice stream;
for example, the neural network classification model may employ a general convolutional neural network for performing representation learning on the input editing vector, and combine with a classifier for mapping the representation learning result to a preset quality classification space. Accordingly, a data set is prepared for training the neural network classification model to converge.
The data set can be sampled from a voice stream generated by a voice room of a live broadcast platform by a person skilled in the art according to the manner disclosed in the embodiments of the present application, and after manually labeling the corresponding quality category, a training sample in the data set is formed. It is understood that, during sampling, the voice streams generated by the same voice room in different unit time periods can be collected to form different training samples, and generally, the information values of the voice streams of the same voice room in different unit time periods are different, so that the quality categories marked correspondingly can be different.
When the neural network classification model is trained once, any training sample can be directly adopted from the data set to obtain the voice stream and the quality class labeled for the voice stream, wherein the former is used for constructing the coding vector required by the input of the classification model, and the latter is used for monitoring the output of the classification model.
The method for constructing the corresponding coding vector for the voice stream in the training sample can be implemented correspondingly according to the corresponding method of any one of the embodiments disclosed in the present application, and in short, as long as the neural network classification model maintains the consistency of the coding vector construction in the training stage and the reasoning stage, the normal use of the neural network classification model can be determined.
Step S4200, extracting deep semantic information from the coding vector corresponding to the voice stream of the training sample through a convolutional neural network;
as described above, the convolutional neural network in the neural network classification model is used as a basic model and is responsible for representing and learning the coding vector constructed corresponding to the voice stream in the training sample, so as to extract deep semantic information of the coding vector.
Step S4300, classifying and mapping the deep semantic information through a classifier to obtain a predicted quality category;
then, the deep semantic information enters a classifier after being fully connected and is mapped into a quality classification space, so that the classification probability corresponding to each quality class of the deep semantic information mapped into the quality classification space is predicted, and the quality class with the maximum classification probability is taken as the quality class corresponding to the coding vector predicted by the model. The quality classification space is preset for determining the voice quality level of the voice stream as described above, and can be flexibly set by those skilled in the art, which will not be repeated herein.
Step S4400, calculating a model loss value of the predicted quality type according to the labeled quality type;
the quality classes pre-labeled in the training samples are used as supervision labels of model output and used for calculating model loss values corresponding to the quality classes predicted by the models.
And step S4500, judging whether the model loss value reaches a preset threshold value, performing gradient updating on the model when the model loss value does not reach the preset threshold value, calling the next training sample to continue to perform iterative training, and otherwise judging that the model is converged and terminating the training.
In order to decide the iterative training process of the neural network classification model, a preset threshold value is provided for the training of the classification model, then the model loss value generated aiming at the training sample is compared with the preset threshold value, when the model loss value does not reach the preset threshold value, each link of the classification model can be subjected to back propagation according to the model loss value so as to correct the weight of each link, and the gradient updating of the classification model is realized. When the model loss value reaches the preset threshold value, the classification model is trained to be in a convergence state, so that the training of the classification model can be stopped, and the classification model can be put into practical use.
According to the embodiments, after the neural network classification model based on deep learning is trained to a convergence state, the neural network classification model is used for determining the mapped quality category according to the coding vector, and the classification model can deeply understand semantic association information among various statistical features in the coding vector to obtain corresponding deep semantic information for classification mapping, so that the effect of performing deep data mining on the coding vector to obtain effective information value is achieved, and accordingly, an accurate quality category judgment effect can be expected.
Referring to fig. 10, according to an alternative embodiment of the present application, after the step of determining the quality class of the speech room according to the coding vector in the step S1300, the method includes the following steps:
step S5100, responding to a voice room recommendation request submitted by the terminal device, and determining a plurality of candidate voice rooms and corresponding basic recommendation scores according to a preset recommendation algorithm;
in an exemplary application scenario, when a user of a live broadcast platform needs to obtain a corresponding voice room recommendation list through a page as shown in fig. 11 at a terminal device of the user, a corresponding voice room recommendation request may be triggered by entering the page for the first time or refreshing the page, and after receiving the request, a voice room service may invoke a preset recommendation algorithm to determine a plurality of candidate voice rooms for the user, and determine a basic recommendation score corresponding to each candidate voice room according to the recommendation algorithm.
The recommendation algorithm can be flexibly implemented by a person skilled in the art according to preset recommendation service logic as required, for example, tag matching can be performed on massive voice rooms in the platform according to tags of voice rooms accessed in the historical behavior data of the user, personalized candidate voice rooms can be matched for the user, and corresponding basic recommendation scores can be quantized according to matching processes of the tags.
In one implementation, the recommendation algorithm may be implemented by using a two-tower model, which takes the vector of the accessed voice room tag in the user historical behavior data as one input path, takes the vector of the full voice room tag in the platform as the other input path, performs semantic similarity matching after representation learning, so as to determine the corresponding semantic similarity, and then performs optimization according to the semantic similarity to obtain a plurality of voice rooms as the candidate voice rooms, where the semantic similarity corresponding to each candidate voice room can be used as the corresponding basic recommendation score.
Step S5200, adjusting corresponding basic recommendation scores according to the preset weight of the quality category correspondingly determined by each candidate voice room to obtain recommendation display scores;
each of the candidate speech rooms may be determined and determined by any one of the embodiments in the foregoing of the present application, and in order to reflect the information value of the quality category, weights for adjusting the recommended presentation scores may be preset for each quality category of the quality classification system, respectively, so that the higher the actually characterized information quality is, the higher the weight thereof is, and the lower the actually characterized information quality is, the lower the weight thereof is. Thereby, a quantitative evaluation of different quality classes is achieved.
For each candidate voice room, the preset weight of the corresponding quality category is multiplied by the recommended display score thereof, and the obtained product can be used as the corresponding recommended display score thereof. Since the weights have been quantified according to different quality categories, the recommendation presentation score is essentially the result of a corresponding drop or increase in weight of the recommendation presentation score.
Step S5300, performing reverse sorting on the candidate voice rooms according to the recommendation display scores to obtain a voice room recommendation list;
after each candidate voice room obtains the corresponding recommendation display score, the candidate voice rooms can be reversely sorted according to the recommendation display score, so that the voice room with better quality is sorted in front, and a final voice room recommendation list is obtained according to the reverse sorting result.
And S5400, responding to the voice room recommendation request, and pushing the voice room recommendation list to the terminal equipment for display.
At this point, the voice room recommendation list may be pushed to the terminal device that submitted the voice room recommendation request, so as to complete the response to the request. In the voice room recommendation list, each item of necessary information of the corresponding voice room can be packaged, including but not limited to an access portal link of the corresponding voice room, a profile of the voice room, and the like. After obtaining the voice room recommendation list, the terminal device performs corresponding analysis and displays the voice room recommendation list in a graphical user interface, as shown in fig. 11.
The embodiment herein exemplarily shows a process that the quality class recognition capability implemented by the present application serves the voice room recommendation service, and thus, under the condition that the quality class of the voice room is accurately and timely determined according to the present application, when the platform recommends a corresponding voice room for a user thereof, the platform can preferentially recommend the voice room according to the information value of the voice room, so that the user can be effectively attracted to reside in the platform, and also can be guided for a high-quality voice room, the voice room recommendation logic of the whole platform is optimized, and a good scale economic utility can be expected to be obtained.
Referring to fig. 12, an apparatus for evaluating speech room quality according to an aspect of the present application includes a speech recognition module 1100, a text encoding module 1200, and a quality recognition module 1300, wherein: the voice recognition module 1100 is configured to obtain a voice stream in a voice room in a unit time period, and recognize a speaking text from the voice stream; the text encoding module 1200 is configured to construct an encoding vector of the spoken text, where the encoding vector includes statistical characteristics of the number of sound source objects of the speech stream, statistical characteristics of total speaking times, and statistical characteristics of the number of effective nouns in the spoken text; the quality identification module 1300 is configured to determine the quality category of the speech room according to the coding vector.
In an alternative embodiment of the present application, the speech recognition module 1100 includes: the segmentation processing submodule is used for acquiring a voice stream of a unit time period generated by the voice room in real time; the voice detection submodule is used for carrying out voice detection on the voice stream and determining voice segments of different sound source objects; and the recognition conversion submodule is used for carrying out voice recognition on the voice segments to obtain the speaking texts corresponding to the voice segments.
In an embodiment of the present application, the text encoding module 1200 includes: the sound source statistic submodule is used for acquiring the number of sound source objects in the voice stream of the unit time period to form corresponding statistic characteristics; the speech statistics submodule is used for acquiring the total speech times in the voice stream of the unit time period to form corresponding statistical characteristics; the noun counting submodule is used for counting the number of effective nouns in the speaking text according to a plurality of preset dimensions to form corresponding counting characteristics; and the coding construction submodule is used for constructing each statistical characteristic into a coding vector according to a preset sequence.
According to an alternative embodiment of the present application, the noun statistic submodule includes: a noun extracting unit, configured to extract nouns in the spoken text to obtain a noun set; the noun filtering unit is used for filtering the noun set according to a preset stop word list to obtain an effective noun set; and the matching statistical unit is used for determining the number of nouns hit in the preset basic noun list by the valid noun set under each matching rule according to the matching rules correspondingly provided by different preset dimensions, and the number is used as the statistical characteristic of the corresponding dimensions.
According to an alternative embodiment of the present application, the noun extraction unit includes: the word segmentation subunit is used for segmenting the speaking text to obtain a word set; the vectorization subunit is used for coding the participles in the participle set into embedded vectors; the part of speech recognition subunit is used for extracting deep semantic information from the embedded vector, performing part of speech recognition according to the deep semantic information and determining the part of speech corresponding to each participle; and the noun extraction subunit is used for extracting the participle structure with the part of speech being the noun as the noun set.
According to an alternative embodiment of the present application, the matching statistic unit includes: the accurate statistics secondary unit is used for counting the number of hit nouns corresponding to the basic nouns in the basic noun table, which are accurately hit by the effective nouns in the effective noun set, according to an accurate matching rule, and taking the number of hit nouns as a statistical feature of comprehensive dimensionality; the subdivision statistical secondary unit is used for carrying out subdivision statistics on the number of hit nouns corresponding to each preset classification under the accurate matching rule according to the preset classifications of the basic nouns in the basic noun table, and taking the number of hit nouns corresponding to each preset classification as a statistical feature corresponding to each preset classification dimension; and the fuzzy statistical secondary unit is used for counting the number of noun hits of effective nouns in the effective noun set which are not accurately hit and are fuzzily hit in the basic noun table according to the fuzzy matching rule, and the number is used as the statistical characteristic of the similar dimension.
In an alternative embodiment of the present application, the fuzzy statistical secondary unit comprises: a redundancy construction subunit, configured to obtain a redundancy subset formed by valid nouns in the valid noun set that do not precisely hit the basic noun table; the similarity calculation subunit is used for calculating the semantic similarity between the vector of each effective noun in the redundant subset and the vector of each basic noun in the basic noun table; and the screening counting subunit is used for counting the effective nouns with the highest semantic similarity exceeding a preset threshold value and counting the noun hit number of fuzzy hit of the basic noun list.
According to another embodiment of the present application, the quality recognition module 1300 is implemented by using a neural network classification model trained to a convergence state in advance, where the neural network classification model is implemented by a preset training device, and the training device includes: the system comprises a sample calling module, a quality analysis module and a quality analysis module, wherein the sample calling module is used for calling a single training sample in a preset data set, and the training sample comprises a voice stream in a unit time period and a quality category labeled for the voice stream; the semantic extraction module is used for extracting deep semantic information from the coding vector corresponding to the voice stream of the training sample through a convolutional neural network; the classification mapping module is used for performing classification mapping on the deep semantic information through a classifier to obtain a predicted quality category; a loss calculation module for calculating a model loss value of the predicted quality class according to the labeled quality class; and the iteration decision module is used for judging whether the model loss value reaches a preset threshold value or not, performing gradient updating on the model when the model loss value does not reach the preset threshold value, calling the next training sample to continue to perform iterative training, and otherwise, judging that the model is converged and terminating the training.
According to an alternative embodiment of the present application, the quality identification module 1300 includes: the request response module is used for responding to the voice room recommendation request submitted by the terminal equipment and determining a plurality of candidate voice rooms and corresponding basic recommendation scores thereof according to a preset recommendation algorithm; the score adjusting module is used for adjusting corresponding basic recommendation scores according to the preset weight of the quality category correspondingly determined by each candidate voice room so as to obtain recommendation display scores; the sorting processing module is used for performing reverse sorting on each candidate voice room according to the recommendation display score to obtain a voice room recommendation list; and the response pushing module is used for responding to the voice room recommendation request and pushing the voice room recommendation list to the terminal equipment for displaying.
Another embodiment of the present application further provides a speech room recognition apparatus, which may be implemented by a computer device. As shown in fig. 13, the internal structure of the computer device is schematically illustrated. The computer device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable the processor to realize the voice room quality assessment method when being executed by the processor. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform the speech room quality assessment method of the present application. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In this embodiment, the processor is configured to execute specific functions of each module in fig. 12, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data necessary for executing all modules in the speech room quality assessment apparatus of the present application, and the server can call the program codes and data of the server to execute the functions of all modules.
The present application also provides a storage medium storing computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the speech room quality assessment method of any of the embodiments of the present application.
The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.
To sum up, the quality category of the voice stream generated by the voice room can be accurately judged, the accuracy of recommending the voice room for the platform user can be improved, the flow of the platform user can be activated, and the parking rate of the platform user can be improved.

Claims (11)

1. A speech room quality assessment method is characterized by comprising the following steps:
acquiring a voice stream in a voice room in a unit time period, and identifying a speaking text from the voice stream;
constructing a coding vector of the speaking text, wherein the coding vector comprises the statistical characteristics of the number of sound source objects, the statistical characteristics of the total speaking times and the statistical characteristics of the number of effective nouns in the speaking text of the voice stream;
and determining the quality class of the speech room according to the coding vector.
2. The method for evaluating the quality of the speech room according to claim 1, wherein the constructing the coding vector of the spoken text comprises the following steps:
acquiring the number of sound source objects in the voice stream in the unit time period to form corresponding statistical characteristics;
acquiring the total speaking times in the voice stream of the unit time period to form corresponding statistical characteristics;
counting the number of effective nouns in the speaking text according to a plurality of preset dimensions to form corresponding statistical characteristics;
and constructing the statistical characteristics into a coding vector according to a preset sequence.
3. The method for evaluating the speech room quality according to claim 2, wherein the step of obtaining the number of nouns in the spoken text according to a plurality of preset dimensions to form corresponding statistical features comprises the following steps:
extracting nouns in the speaking text to obtain a noun set;
filtering the name word set according to a preset stop word list to obtain an effective name word set;
and determining the number of nouns hit in the preset basic noun list hit by the valid noun set under each matching rule according to the matching rules correspondingly provided by different preset dimensions, and taking the number as the statistical characteristic of the corresponding dimensions.
4. The method for evaluating speech room quality according to claim 3, wherein said extracting nouns in said spoken text comprises the steps of:
performing word segmentation on the speaking text to obtain a word set;
encoding the participles in the participle set into embedded vectors;
deep semantic information is extracted from the embedded vector, part of speech recognition is carried out according to the deep semantic information, and part of speech corresponding to each participle is determined;
and extracting participles with parts of speech as nouns to construct the noun set.
5. The method for evaluating the speech room quality according to claim 3, wherein the step of determining the number of nouns hit in the preset basic vocabulary by the valid noun set under each matching rule according to the matching rules correspondingly provided by different preset dimensions as the statistical characteristics of the corresponding dimensions comprises the following steps:
according to the precise matching rule, counting the number of hit nouns corresponding to the basic nouns in the basic noun table, which are precisely hit by the effective nouns in the effective noun set, and taking the number as the statistical feature of the comprehensive dimensionality;
according to the preset classification of the basic nouns in the basic noun table, under the accurate matching rule, the hit number of nouns corresponding to each preset classification is accurately hit in a subdivision and statistics mode and serves as the statistical feature corresponding to each preset classification dimension;
and according to the fuzzy matching rule, counting the number of noun hits of the basic nouns in the basic noun table, wherein the noun hits of the basic nouns in the effective noun set are not accurately hit and are fuzzy hit, and the number is used as the statistical characteristic of the similar dimension.
6. The method for evaluating the quality of a speech room according to claim 5, wherein said counting the number of noun hits of valid noun in the valid noun set that are not hit precisely and fuzzy hit on the basic noun in the basic noun table according to the fuzzy matching rule as the statistical feature of the similarity dimension comprises the following steps:
obtaining effective nouns which do not accurately hit the basic noun table in the effective noun set to form a redundant subset;
calculating semantic similarity between the vector of each effective noun in the redundant subset and the vector of each basic noun in the basic noun table;
and counting the effective nouns with the highest semantic similarity exceeding a preset threshold value, and counting the noun hit number of fuzzy hit in the basic noun list.
7. The method according to any one of claims 1 to 6, wherein the step of determining the quality class of the speech room according to the coding vector is followed by the steps of:
responding to a voice room recommendation request submitted by terminal equipment, and determining a plurality of candidate voice rooms and corresponding basic recommendation scores thereof according to a preset recommendation algorithm;
according to the preset weight of the quality category correspondingly determined by each candidate voice room, adjusting the corresponding basic recommendation score to obtain a recommendation display score;
performing reverse sorting on each candidate voice room according to the recommendation display score to obtain a voice room recommendation list;
and responding to the voice room recommendation request, and pushing the voice room recommendation list to the terminal equipment for display.
8. A speech room quality assessment apparatus, comprising:
the voice recognition module is used for acquiring a voice stream in a voice room in a unit time period and recognizing a speaking text from the voice stream;
the text coding module is used for constructing a coding vector of the speaking text, wherein the coding vector comprises the sound source object number statistical characteristic, the total speaking frequency statistical characteristic and the effective noun number statistical characteristic of the speaking text of the voice stream;
and the quality identification module is used for determining the quality category of the speech room according to the coding vector.
9. A speech room recognition device comprising a central processor and a memory, characterized in that the central processor is adapted to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.
11. A computer program product comprising computer programs/instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 7.
CN202210470807.6A 2022-04-28 2022-04-28 Voice room quality evaluation method and device, equipment, medium and product thereof Pending CN114841143A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210470807.6A CN114841143A (en) 2022-04-28 2022-04-28 Voice room quality evaluation method and device, equipment, medium and product thereof
PCT/CN2023/087339 WO2023207566A1 (en) 2022-04-28 2023-04-10 Voice room quality assessment method, apparatus, and device, medium, and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210470807.6A CN114841143A (en) 2022-04-28 2022-04-28 Voice room quality evaluation method and device, equipment, medium and product thereof

Publications (1)

Publication Number Publication Date
CN114841143A true CN114841143A (en) 2022-08-02

Family

ID=82567325

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210470807.6A Pending CN114841143A (en) 2022-04-28 2022-04-28 Voice room quality evaluation method and device, equipment, medium and product thereof

Country Status (2)

Country Link
CN (1) CN114841143A (en)
WO (1) WO2023207566A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023207566A1 (en) * 2022-04-28 2023-11-02 广州市百果园信息技术有限公司 Voice room quality assessment method, apparatus, and device, medium, and product

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103679462B (en) * 2012-08-31 2019-01-15 阿里巴巴集团控股有限公司 A kind of comment data treating method and apparatus, a kind of searching method and system
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
CN107608964B (en) * 2017-09-13 2021-01-12 上海六界信息技术有限公司 Live broadcast content screening method, device, equipment and storage medium based on barrage
CN108320101A (en) * 2018-02-02 2018-07-24 武汉斗鱼网络科技有限公司 Direct broadcasting room operation ability appraisal procedure, device and terminal device
CN113064994A (en) * 2021-03-25 2021-07-02 平安银行股份有限公司 Conference quality evaluation method, device, equipment and storage medium
CN114841143A (en) * 2022-04-28 2022-08-02 广州市百果园信息技术有限公司 Voice room quality evaluation method and device, equipment, medium and product thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023207566A1 (en) * 2022-04-28 2023-11-02 广州市百果园信息技术有限公司 Voice room quality assessment method, apparatus, and device, medium, and product

Also Published As

Publication number Publication date
WO2023207566A1 (en) 2023-11-02

Similar Documents

Publication Publication Date Title
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN112069298B (en) Man-machine interaction method, device and medium based on semantic web and intention recognition
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN111475650B (en) Russian semantic role labeling method, system, device and storage medium
CN111209363B (en) Corpus data processing method, corpus data processing device, server and storage medium
CN112131876A (en) Method and system for determining standard problem based on similarity
CN112036705A (en) Quality inspection result data acquisition method, device and equipment
CN115146629A (en) News text and comment correlation analysis method based on comparative learning
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN111667817A (en) Voice recognition method, device, computer system and readable storage medium
CN111930933A (en) Detection case processing method and device based on artificial intelligence
Zhang et al. Research on keyword extraction of Word2vec model in Chinese corpus
CN110347833B (en) Classification method for multi-round conversations
WO2023207566A1 (en) Voice room quality assessment method, apparatus, and device, medium, and product
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis
CN111460114A (en) Retrieval method, device, equipment and computer readable storage medium
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
TW202034207A (en) Dialogue system using intention detection ensemble learning and method thereof
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN112989001B (en) Question and answer processing method and device, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination