CN116687343A - Pronunciation assessment method, device and equipment for assisting diagnosis of sounding lesions - Google Patents

Pronunciation assessment method, device and equipment for assisting diagnosis of sounding lesions Download PDF

Info

Publication number
CN116687343A
CN116687343A CN202210189721.6A CN202210189721A CN116687343A CN 116687343 A CN116687343 A CN 116687343A CN 202210189721 A CN202210189721 A CN 202210189721A CN 116687343 A CN116687343 A CN 116687343A
Authority
CN
China
Prior art keywords
pronunciation
sounding
data
evaluated
formant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210189721.6A
Other languages
Chinese (zh)
Inventor
康迂勇
肖玮
商世东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210189721.6A priority Critical patent/CN116687343A/en
Publication of CN116687343A publication Critical patent/CN116687343A/en
Pending legal-status Critical Current

Links

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/74Details of notification to user or communication with user or patient ; user input means
    • A61B5/742Details of notification to user or communication with user or patient ; user input means using visual displays
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/74Details of notification to user or communication with user or patient ; user input means
    • A61B5/742Details of notification to user or communication with user or patient ; user input means using visual displays
    • A61B5/7425Displaying combinations of multiple images regardless of image source, e.g. displaying a reference anatomical image with a live image
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/74Details of notification to user or communication with user or patient ; user input means
    • A61B5/742Details of notification to user or communication with user or patient ; user input means using visual displays
    • A61B5/743Displaying an image simultaneously with additional graphical information, e.g. symbols, charts, function plots
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Veterinary Medicine (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Pathology (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Epidemiology (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application discloses a pronunciation assessment method, a pronunciation assessment device and pronunciation assessment equipment for assisting in diagnosis of a pronunciation lesion, and belongs to the technical field of computers. The method comprises the following steps: displaying a pronunciation assessment interface for assisting in diagnosis of the sounding lesions, and displaying visual prompt information of the voice to be assessed in the pronunciation assessment interface; recording pronunciation data of a user account aiming at the voice to be evaluated in response to the recording operation; and displaying an evaluation result interface for sounding lesion reference, wherein an evaluation score of sounding quality is displayed in the evaluation result interface, and the evaluation score is used for evaluating the health degree of sounding data compared with healthy sounding data, wherein the healthy sounding data is sounding data of a normal person aiming at the voice to be evaluated. The evaluation score can reflect the pronunciation quality of the user, so that pronunciation evaluation can be carried out on the user, and further diagnosis of the assisted sounding lesion can be realized. In the process, the medical equipment and the manual judgment are not needed, and the pronunciation assessment efficiency is improved.

Description

Pronunciation assessment method, device and equipment for assisting diagnosis of sounding lesions
Technical Field
The present application relates to the field of computer technologies, and in particular, to a pronunciation assessment method, apparatus, and device for assisting in diagnosis of a sounding lesion.
Background
Speech is the most convenient and quick way for human beings to interact with information. Factors such as illness (e.g., stroke, hearing impairment), hypoplasia of the sound producing organ, aging of the sound producing organ, etc. affect coordinated movements of the organs during the sound producing process, thereby affecting sound producing quality, which may be diagnosed as having sound producing lesions.
In the related art, pronunciation assessment work of a patient is performed by a speech rehabilitation doctor. Specifically, the speech rehabilitation doctor obtains dynamic information of the sounding organ of the patient when sounding by using professional medical equipment, and can evaluate the sounding quality of the patient by combining clinical experience, so that diagnosis of sounding lesions is realized.
Through the mode, the pronunciation quality is estimated, expensive medical equipment is required to be used for collecting data, the operation is complex in the collecting process, and a speech rehabilitation doctor is required to conduct manual judgment, so that the pronunciation estimation efficiency is low.
Disclosure of Invention
The application provides a pronunciation assessment method, a pronunciation assessment device and pronunciation assessment equipment for assisting in diagnosis of a sounding lesion, which can improve pronunciation assessment efficiency. The technical scheme is as follows:
according to an aspect of the present application, there is provided a pronunciation assessment method for assisting in the diagnosis of a sounding lesion, the method comprising:
Displaying a pronunciation assessment interface for assisting in diagnosis of a pronunciation lesion, wherein visual prompt information of a voice to be assessed is displayed in the pronunciation assessment interface;
recording pronunciation data of a user account aiming at the voice to be evaluated in response to recording operation;
and displaying an evaluation result interface for sounding lesion reference, wherein an evaluation score of sounding quality is displayed in the evaluation result interface and is used for evaluating the health degree of the sounding data compared with healthy sounding data, and the healthy sounding data is sounding data of a normal person aiming at the voice to be evaluated.
According to another aspect of the present application, there is provided a sound assessment apparatus for assisting in diagnosis of sound-emitting lesions, the apparatus comprising:
the display module is used for displaying a pronunciation assessment interface for assisting in diagnosis of the sounding lesions, and visual prompt information of the voice to be assessed is displayed in the pronunciation assessment interface;
the recording module is used for responding to the recording operation and recording pronunciation data of the user account aiming at the voice to be evaluated;
the display module is further configured to display an evaluation result interface for sounding lesion reference, where an evaluation score of sounding quality is displayed in the evaluation result interface, where the evaluation score is used to evaluate a health degree of the sounding data compared with health sounding data, where the health sounding data is sounding data of the normal person for the voice to be evaluated.
In an alternative design, the speech to be evaluated includes vowels to be evaluated that correspond to formant characteristics; the apparatus further comprises:
the determining module is used for determining the formant characteristics to be evaluated of the pronunciation data;
the determining module is further configured to determine the evaluation score according to a degree of similarity between the formant feature to be evaluated and the healthy formant feature of the healthy pronunciation data;
the display module is used for displaying the evaluation result interface and displaying the evaluation score on the evaluation result interface.
In an alternative design, the determining module is configured to:
determining formants for each audio frame in the voicing data;
generating a formant histogram according to formants of each audio frame;
adjusting formants of each audio frame according to peaks in the formant histogram;
and determining the average value of the formants after the adjustment of each audio frame as the formant characteristics to be evaluated of the pronunciation data.
In an alternative design, the formant characteristics to be evaluated include formant 1 to be evaluated and formant 2 to be evaluated, and the healthy formant characteristics include healthy formant 1 and healthy formant 2; the determining module is used for:
Fitting ellipses according to the position points of the healthy formants 1 and 2 of the healthy pronunciation data in a coordinate system with a first coordinate axis being a formant 1 and a second coordinate axis being a formant 2;
determining the position points of the formant characteristics to be evaluated in the coordinate system according to the formants to be evaluated 1 and the formants to be evaluated 2;
determining a distance between a location point of the formant feature to be evaluated and a center of the ellipse;
and determining the evaluation score according to the distance, wherein the distance is inversely related to the evaluation score.
In an alternative design, the formant characteristics to be evaluated include formant 1 to be evaluated and formant 2 to be evaluated, and the healthy formant characteristics include healthy formant 1 and healthy formant 2; the determining module is used for:
according to the formants 1 and 2 to be evaluated, determining probability density function values of the formant characteristics to be evaluated through a Gaussian mixture model, wherein parameters of the Gaussian mixture model are obtained through learning through an expected maximum algorithm according to healthy formants 1 and 2 of the healthy pronunciation data;
determining a limit maximum probability density function value of the Gaussian mixture model according to the sum of products of maximum probability density function values of the Gaussian mixture model under different Gaussian components and weights of the Gaussian components;
And determining the evaluation score according to the distance between the probability density function value of the formant characteristic to be evaluated and the limit maximum probability density function value, wherein the distance is inversely related to the evaluation score.
In an alternative design, the apparatus further comprises:
the acquisition module is used for responding to the recording operation, acquiring a face image of the user account and acquiring sounding airflow data of the user account aiming at the voice to be evaluated;
the determining module is used for determining the pronunciation data, the face image and the sounding airflow data together as pronunciation related data of the user account for the voice to be evaluated;
the display module is used for displaying the evaluation result interface and displaying the evaluation score in the evaluation result interface, wherein the evaluation score is used for evaluating the health degree of the pronunciation related data compared with health pronunciation related data, and the health pronunciation related data are pronunciation related data of a normal person aiming at the voice to be evaluated.
In an alternative design, the apparatus further comprises:
the extraction module is used for extracting formant feature vectors of the pronunciation data; extracting a face feature vector of the face image; extracting sound-producing airflow characteristic vectors of the sound-producing airflow data;
The fusion module is used for fusing the formant feature vector, the face feature vector and the feature vector of the sounding airflow feature vector to obtain a fusion feature vector;
the determination module is used for inputting the fusion feature vector into a scoring model to obtain the evaluation score, the scoring model is obtained through difference training between a prediction score and a true score of a health fusion feature vector corresponding to the health pronunciation related data, the prediction score is obtained through prediction of the health fusion feature vector by the scoring model, and the true score is obtained through manual labeling of the health fusion feature vector;
the display module is used for displaying the evaluation result interface and displaying the evaluation score on the evaluation result interface.
In an alternative design, the display module is configured to:
displaying a lesion prediction control on the evaluation result interface;
and responding to the lesion prediction operation triggered on the lesion prediction control, and displaying a lesion prediction result on the evaluation result interface, wherein the lesion prediction result is used for predicting whether sounding lesions appear.
In an alternative design, the display module is configured to:
displaying an assessment score threshold in the assessment results interface;
and displaying the lesion prediction control on the evaluation result interface in the case that the evaluation score is smaller than the evaluation score threshold value.
In an alternative design, the apparatus further comprises:
the acquisition module is used for responding to the lesion prediction operation and acquiring historical pronunciation data of the user account aiming at the voice to be evaluated;
the prediction module is used for predicting the lesion prediction result according to the difference between the historical pronunciation data and the pronunciation data;
and the display module is used for displaying the lesion prediction result on the evaluation result interface.
In an alternative design, the prediction module is configured to:
inputting the characteristics of the historical pronunciation data and the characteristics of the pronunciation data into a classification model to obtain the lesion prediction result;
the classification model is obtained through error training between a prediction label and a real label, the prediction label is obtained by predicting whether a sample user has sounding lesions according to the characteristics of first sounding data and the characteristics of second sounding data, the first sounding data is sounding data of the sample user in a first period, the second sounding data is sounding data of the sample user in a second period, the second period is after the first period, and the real label is used for reflecting whether the sample user has sounding lesions between the first period and the second period.
In an alternative design, the display module is configured to:
and displaying a sounding animation on the sounding evaluation interface, wherein the sounding animation is used for reflecting the dynamic state of the process that the human sounding organ emits the healthy sound corresponding to the voice to be evaluated.
In an alternative design, the display module is configured to:
and responding to the lesion prediction operation, and displaying a historical evaluation score of the user account for the voice to be evaluated on an evaluation result interface.
According to another aspect of the present application, there is provided a computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set loaded and executed by the processor to implement a pronunciation assessment method for assisting in the diagnosis of a sounding lesion as described in the above aspect.
According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the pronunciation assessment method for assisting in the diagnosis of a sounding lesion as described in the above aspect.
According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the pronunciation assessment method for assisting with the diagnosis of a sounding lesion provided in various optional implementations of the above aspects.
The technical scheme provided by the application has the beneficial effects that at least:
and acquiring pronunciation data of the user aiming at the voice to be evaluated by displaying a pronunciation evaluation interface, so that an evaluation score is displayed on an evaluation result interface. The evaluation score can reflect the pronunciation quality of the user, so that pronunciation evaluation can be carried out on the user, and further diagnosis of the assisted sounding lesion can be realized. In the process, the medical equipment and the manual judgment are not needed, and the pronunciation assessment efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a F1-F2 distribution diagram of a common vowel provided by an exemplary embodiment of the present application;
FIG. 2 is a F1-F2 distribution when a normal person reads three vowels provided by an exemplary embodiment of the present application;
FIG. 3 is a block diagram of a computer system provided in accordance with an exemplary embodiment of the present application;
FIG. 4 is an interface diagram of a process for performing pronunciation assessment, provided by an exemplary embodiment of the present application;
FIG. 5 is a flow chart of a pronunciation assessment method for assisting in the diagnosis of a sounding lesion provided by an exemplary embodiment of the present application;
FIG. 6 is a flow chart of a pronunciation assessment method for assisting in the diagnosis of a sounding lesion provided by an exemplary embodiment of the present application;
FIG. 7 is a schematic illustration of an audio animation provided by an exemplary embodiment of the present application;
FIG. 8 is a schematic illustration of a fitted ellipse provided by an exemplary embodiment of the present application;
FIG. 9 is a schematic representation of an equiprobable density curve provided by an exemplary embodiment of the present application;
FIG. 10 is a schematic diagram of an assessment results interface provided by an exemplary embodiment of the present application;
FIG. 11 is a schematic diagram of an assessment results interface provided by an exemplary embodiment of the present application;
FIG. 12 is a schematic diagram of an assessment results interface provided by an exemplary embodiment of the present application;
FIG. 13 is a schematic diagram of a process for performing pronunciation assessment, provided by an exemplary embodiment of the present application;
FIG. 14 is a flow chart of a pronunciation assessment method for assisting in the diagnosis of a sounding lesion provided by an exemplary embodiment of the present application;
fig. 15 is a schematic structural view of a pronunciation assessment device for assisting in diagnosis of a sounding lesion according to an exemplary embodiment of the present application;
FIG. 16 is a schematic diagram of a sound assessment apparatus for assisting in diagnosis of a sound emitting lesion provided in an exemplary embodiment of the present application;
FIG. 17 is a schematic diagram of a sound assessment apparatus for assisting in diagnosis of a sound emitting lesion provided in an exemplary embodiment of the present application;
FIG. 18 is a schematic diagram of a sound assessment apparatus for assisting in diagnosis of a sound emitting lesion provided in an exemplary embodiment of the present application;
FIG. 19 is a schematic diagram of a sound assessment apparatus for assisting in diagnosis of a sound emitting lesion provided in an exemplary embodiment of the present application;
fig. 20 is a schematic structural view of a terminal according to an exemplary embodiment of the present application.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
First, the relevant background of the application will be described:
speech is the most convenient and quick way for human beings to interact with information. Factors such as disease (e.g., cerebral stroke, hearing impairment), hypoplasia of the vocal organ, organ aging, etc., affect coordinated movements of the organs during the articulation process, and thus affect articulation quality. Currently, there are a large number of patients who need targeted speech rehabilitation. The pronunciation assessment is an important link for guiding the patient to carry out speech rehabilitation training, and the importance degree is higher. Traditional pronunciation assessment needs to be performed at a hospital or a professional voice rehabilitation facility. The patient cannot grasp the pronunciation state at any time, so that bad pronunciation can be corrected in time.
The requirements for pronunciation assessment can be summarized as:
(1) The user can autonomously perform pronunciation assessment and master own pronunciation state. The pronunciation assessment client may be running in a terminal. The user can evaluate the pronunciation at any time through the pronunciation evaluation client according to the self demand. Therefore, when the user is inconvenient to go to a professional institution for speech rehabilitation, the method can also assist in carrying out targeted speech rehabilitation training according to the evaluation result of the pronunciation evaluation client.
(2) The clinician can carry out medical assistance through the pronunciation assessment client and realize large-scale crowd pronunciation screening. Specifically, the pronunciation assessment client can assist a doctor in medical diagnosis, so that the doctor can know the basic pronunciation state of a patient suffering from dysarthria. Through pronunciation screening by the pronunciation assessment client, doctors can master the etiology, age distribution and the like of a large number of abnormal pronunciation cases, understand the health status of pronunciation of the whole people and guide the medical study of dysarthria.
Currently, most of the traditional pronunciation assessment efforts rely on speech rehabilitation doctors, relying on their abundant clinical experience and using specialized medical equipment to diagnose impaired patients. For traditional pronunciation assessment, because professional medical equipment is high in price and complex in operation, a user must go to a professional institution to conduct pronunciation assessment at a specific time, and pronunciation assessment results are easily affected by clinical experience of doctors, so that the problem of low pronunciation assessment efficiency exists. The method provided by the application provides a pronunciation assessment method for assisting in diagnosing sounding lesions based on the client morphology, so that a user can flexibly conduct pronunciation assessment at any time and master own pronunciation state in time.
Introduction to Formants (F):
the air flow generated by the lung of the human body is shaped through the sound channel and the oral cavity structure, and finally the voice is formed. From a signal and system perspective, the airflow can be considered as a signal source, vocal tract, and oral structure as a filter. The process of speech content generation is the resonance process of the air flow in the oral filter. Wherein different filter structures (including mouth size, tongue position, etc.) determine different pronunciation content. In speech, phonemes are the most basic units of pronunciation. Each phoneme has a similar filter structure during pronunciation by different people. The common formants represent the properties of different filters.
The speech information is mainly determined by the first two formants (F1, F2). F1 is mainly related to the size of the mouth shape, and the larger F1, the larger the mouth is. F2 is primarily related to the anterior and posterior aspects of the tongue, with the larger F2 the more anterior the tongue. F1, F2 can therefore be used as features for pronunciation assessment. Vowels, also known as vowels, are a type of phoneme, as opposed to consonants, vowels are sounds in which air flows through the mouth during pronunciation without being impeded. Illustratively, FIG. 1 is a F1-F2 distribution diagram of a common vowel provided by an exemplary embodiment of the present application. As shown in fig. 1, in the F1-F2 distribution diagram of the vowels in chinese, there are three typical vertices. Vertex 101, vertex 102, and vertex 103, respectively. Wherein vertex 101 corresponds to vowel a, vertex 102 corresponds to vowel i, and vertex 103 corresponds to vowel u. The three vowels have the following characteristics: the F2 of a and the F2 of u are similar and the difference of F1 is the largest, and the method can be used for distinguishing the mouth shape size during pronunciation. u and i have similar F1 and the largest difference F2, and can be used for distinguishing tongue positions during pronunciation.
FIG. 2 is a F1-F2 distribution when a normal person reads three vowels, as provided by an exemplary embodiment of the present application. As shown in fig. 2, F1 and F2 for the pronunciations of different vowels (a, u, i) have distinct clusters of corresponding points in a coordinate system with a horizontal axis F1 and a vertical axis F2. For example, a point in range 201 corresponds to the pronunciation of a, a point in range 202 corresponds to the pronunciation of i, and a point in range 203 corresponds to the pronunciation of u. The method provided by the embodiment of the application comprises the steps of fitting the categories of the pronunciation of the vowels to be evaluated in the coordinate system of a normal person, and when the method is used for evaluation, if the F1 and F2 distribution ranges of the pronunciation to be evaluated are close to the F1 and F2 distribution ranges of the pronunciation of the normal person, the pronunciation to be evaluated can be considered to be healthy pronunciation, otherwise, the pronunciation to be evaluated can be considered to be unhealthy pronunciation. Thereby realizing pronunciation assessment for the user.
FIG. 3 is a block diagram of a computer system provided in an exemplary embodiment of the application. The computer system 300 includes: a first terminal 310, a server 320.
The first terminal 310 is installed and operated with a first client 313 supporting pronunciation assessment. When the first terminal 310 runs the first client 313, a user interface of the first client 313 is displayed on a screen of the first terminal 310. The first client 313 may be any one of a medical-type program, an assisted medical diagnosis program, a learning program, a daily training program, and an applet. The first terminal 310 is a terminal used by the first user 312, and the first client 313 has a first user account of the first user 312 registered thereon. The first terminal 110 may refer broadly to one of a plurality of terminals. Optionally, the device types of the first terminal 310 include: at least one of a smart phone, a tablet computer, an electronic book reader, an MP3 player, an MP4 player, a laptop portable computer, and a desktop computer.
Only one terminal is shown in fig. 3, but in different embodiments there are a plurality of other terminals 330 that can access the server 320. Optionally, there are one or more terminals 330 corresponding to the developer, a development and editing platform for supporting the client for pronunciation assessment is installed on the terminals 330, the developer can edit and update the client on the terminals 330, and transmit the updated application installation package to the server 320 through a wired or wireless network, and the first terminal 310 can download the application installation package from the server 320 to implement the update of the client.
The first terminal 310 and the other terminals 330 are connected to the server 320 through a wireless network or a wired network.
Server 320 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The server 320 is configured to provide a background service for a client supporting instant messaging. Optionally, the server 320 takes on primary computing work and the terminal takes on secondary computing work; alternatively, the server 320 takes on secondary computing work and the terminal takes on primary computing work; alternatively, a distributed computing architecture is used for collaborative computing between the server 320 and the terminals.
In one illustrative example, server 320 includes a processor 322, a user account database 323, a pronunciation assessment service module 324, and a user-oriented Input/Output Interface (I/O Interface) 325. Wherein the processor 322 is configured to load instructions stored in the server 323, process data in the user account database 323 and the pronunciation assessment service module 324; the user account database 323 is used for storing data of user accounts used by the first terminal 310 and other terminals 330, such as an avatar of the user account, a nickname of the user account, a group in which the user account is located, and the like; the pronunciation assessment service module 324 is configured to implement pronunciation assessment related functions; the user-oriented I/O interface 325 is used to establish communication exchanges of data with the first terminal 310 through a wireless network or a wired network.
FIG. 4 is an interface diagram of a process for performing pronunciation assessment, according to an exemplary embodiment of the present application. As shown in fig. 4 (a), the client displays a pronunciation assessment interface 401, and an assessment control 402 is displayed in the pronunciation assessment interface 401. The pronunciation assessment interface 401 also displays description and attention information of the assessment process, and the pronunciation assessment interface 401 also displays preview controls for triggering playing of demonstration pronunciation data under different vowels to be assessed of the pronunciation assessment.
As shown in fig. 4 (b), in response to an evaluation operation on the evaluation control 402, the client displays visual cue information 403 of a vowel to be evaluated on the pronunciation evaluation interface 401. Optionally, the client also displays a play control 404 and a record control 405 on the pronunciation assessment interface 401. The play control 404 is used for triggering to play the demonstration pronunciation data corresponding to the vowels to be evaluated, and the record control is used for recording the pronunciation data of the user account for the vowels to be evaluated. In the case where pronunciation assessment is performed for a plurality of vowels to be assessed, an assessment progress is also displayed in the interface for reflecting the order of the vowels to be assessed in the current interface among all the vowels to be assessed. The interface also displays a re-recording control and a switching control, thereby realizing the purposes of re-collecting pronunciation data and switching recording pronunciation data of the next vowel to be evaluated.
As shown in fig. 4 (c), after completing the pronunciation data collection of one vowel to be evaluated in the pronunciation evaluation, the client switches to the pronunciation evaluation interface 401 corresponding to the next vowel to be evaluated until completing the pronunciation data collection of all the vowels to be evaluated.
As shown in fig. 4 (d), the client displays the evaluation score 407 in the displayed evaluation result interface 406. The evaluation score 407 is used for evaluating the health degree of the pronunciation data compared with the health pronunciation data, which is the pronunciation data of the normal person for the vowels to be evaluated. In the case where there are a plurality of vowels to be evaluated, the evaluation score 407 is determined according to the mean or weighted mean of scores corresponding to different vowels to be evaluated. Optionally, the client collects pronunciation data corresponding to each vowel to be evaluated, and performs pronunciation evaluation after all collection is completed. Or after collecting pronunciation data corresponding to a vowel to be evaluated, the client performs pronunciation evaluation, and then gathers pronunciation evaluation results corresponding to the vowels to be evaluated, so as to obtain a final evaluation score 407.
And acquiring pronunciation data of the user aiming at the voice to be evaluated by displaying a pronunciation evaluation interface, so that an evaluation score is displayed on an evaluation result interface. The evaluation score can reflect the pronunciation quality of the user, so that pronunciation evaluation can be carried out on the user, and further diagnosis of the assisted sounding lesion can be realized. In the process, the medical equipment and the manual judgment are not needed, and the pronunciation assessment efficiency is improved.
Fig. 5 is a flow chart of a pronunciation assessment method for assisting in diagnosis of a sounding lesion according to an exemplary embodiment of the present application. The method may be used for a terminal or a client on a terminal in a system as shown in fig. 3. As shown in fig. 5, the method includes:
step 502: a pronunciation assessment interface for assisting in the diagnosis of a sounding lesion is displayed.
The pronunciation assessment interface is a user interface in the client for triggering pronunciation assessment. Optionally, the client starts to execute the flow of pronunciation assessment when starting to display the pronunciation assessment interface. Or the client starts to execute the flow of pronunciation assessment according to the pronunciation assessment operation triggered in the pronunciation assessment interface. For example, an evaluation control is displayed in the pronunciation evaluation interface, and the evaluation control is used for triggering the client to start executing the pronunciation evaluation process. Optionally, in the case that the evaluation control is displayed in the pronunciation evaluation interface, the explanation of the evaluation process and the attention information are also displayed in the pronunciation evaluation interface for prompting the user.
The pronunciation assessment interface displays visual prompt information of the voice to be assessed. The speech to be evaluated is speech used in performing pronunciation evaluation. Such as words, phonemes, vowels, etc. And the client side realizes pronunciation assessment on the user according to the pronunciation of the user aiming at the voice to be assessed. Optionally, the speech under evaluation is set by a developer of the client. For example, the language of pronunciation assessment is Chinese, and the speech to be assessed is vowels to be assessed, including a, u, i. When the pronunciation assessment language is other languages, the speech to be assessed is different. For example, for different languages, the speech to be evaluated belonging to that language is set. The application is not limited to speech to be evaluated. Before starting the process of performing the pronunciation assessment, and/or during the process of performing the pronunciation assessment, the client displays visual prompt information of the to-be-assessed voice on the pronunciation assessment interface.
Optionally, for a pronunciation assessment, there are one or more voices to be assessed. And under the condition that a plurality of voices to be evaluated exist, only displaying visual prompt information of the voices to be evaluated corresponding to the current pronunciation evaluation in the pronunciation evaluation interface. Illustratively, with continued reference to FIG. 2, in evaluating the pronunciation of "a", the client displays "o" in the pronunciation evaluation interface. When evaluating the pronunciation of "u", the client displays "Wu" in the pronunciation evaluation interface.
Optionally, in the case that the speech to be evaluated is a vowel to be evaluated, for convenience of understanding, when the client displays the visual prompt information of the vowel to be evaluated, the client displays the visual prompt information by using the word with the same pronunciation as the vowel to be evaluated. For example, the vowels to be evaluated are "a", and "o" is displayed when the visual cue is displayed. The vowels to be evaluated are "u", the "whin" is displayed when the visual prompt information is displayed, the "i" is displayed when the visual prompt information is displayed. In addition, the client can select other words with the same pronunciation as the vowels to be evaluated as visual prompt information, which is not limited by the application.
Step 504: and responding to the recording operation, and recording pronunciation data of the user account aiming at the voice to be evaluated.
After the client starts the process of pronunciation assessment, pronunciation data of the user account for the voice to be assessed is recorded. The client collects pronunciation data through an element on the terminal for collecting sound, for example, the element is a microphone. Optionally, the recording operation is automatically triggered by the client when the client starts to perform the pronunciation assessment process, that is, the client starts to record pronunciation data automatically when the client starts to perform the pronunciation assessment process. Or after the client starts to execute the pronunciation assessment process, displaying a recording control on the pronunciation assessment interface, wherein the recording control is used for triggering the recording operation.
After the client starts the process of performing pronunciation assessment, the client displays a play control in the pronunciation assessment interface, where the play control is used to trigger playing of the demonstration pronunciation data corresponding to the speech to be assessed. The client can also automatically play the demonstration pronunciation data corresponding to the voice to be evaluated before recording the pronunciation data for the voice to be evaluated each time. Optionally, the client also displays a re-recording control on the pronunciation assessment interface, so that the pronunciation data can be collected again according to the requirement of the user. Under the condition that a plurality of voices to be evaluated exist in one pronunciation evaluation, the client side also displays a switching control on the pronunciation evaluation interface, and the switching control is used for switching and recording pronunciation data aiming at different voices to be evaluated. The client can also automatically switch to the next voice to be evaluated for recording the pronunciation data after recording the pronunciation data of a certain voice to be evaluated.
Optionally, the client may further obtain a face image of the user account when the user account pronounces the voice to be evaluated, and/or obtain sound airflow data of the user account for the voice to be evaluated, in addition to recording sound data of the user account for the voice to be evaluated. And then the client performs pronunciation assessment through the collected various data.
Step 506: and displaying an evaluation result interface for sounding lesion reference, wherein an evaluation score of sounding quality is displayed in the evaluation result interface.
The assessment score is used to assess the health of the pronunciation data as compared to the health pronunciation data. Optionally, the evaluation score is determined according to a degree of similarity between the pronunciation data and healthy pronunciation data corresponding to the speech to be evaluated. The higher the similarity of pronunciation data to healthy pronunciation data, the higher the evaluation score. The lower the similarity of pronunciation data to healthy pronunciation data, the lower the evaluation score. The healthy pronunciation data is pronunciation data of a normal person aiming at the voice to be evaluated, the healthy pronunciation data is obtained by collecting pronunciation data of the normal person aiming at the voice to be evaluated, the normal person is random in a crowd, or it is determined that no sounding lesion exists in the crowd.
In the case that a plurality of voices to be evaluated exist, the evaluation score is determined according to the average value or weighted average value of scores corresponding to different voices to be evaluated. Optionally, the client collects pronunciation data corresponding to each to-be-evaluated voice, and performs pronunciation evaluation after all collection is completed. Or after collecting pronunciation data corresponding to one to-be-evaluated voice, the client performs pronunciation evaluation, and then gathers pronunciation evaluation results corresponding to each to-be-evaluated voice, so as to obtain a final evaluation score.
Optionally, a submission control is also displayed in the pronunciation assessment interface, and the submission control is used for triggering the completion of recording pronunciation data. In response to a submit operation on the submit control, the client displays an assessment results interface. Or after finishing recording the pronunciation data corresponding to each voice to be evaluated, the client automatically jumps to display the evaluation result interface.
In summary, according to the method provided by the embodiment, the pronunciation data of the voice to be evaluated by the user is collected by displaying the pronunciation evaluation interface, so that the evaluation score is displayed on the evaluation result interface. The evaluation score can reflect the pronunciation quality of the user, so that pronunciation evaluation can be carried out on the user, and further diagnosis of the assisted sounding lesion can be realized. In the process, the medical equipment and the manual judgment are not needed, and the pronunciation assessment efficiency is improved.
Optionally, the client can perform pronunciation assessment for pronunciation data of the to-be-assessed voice according to the user account. The client can also perform pronunciation assessment for pronunciation related data of the to-be-assessed voice according to the user account. The pronunciation related data comprise pronunciation data of a user account for the voice to be evaluated, a face image of the user account for the voice to be evaluated when the user account pronounces, and sounding airflow data of the user account for the voice to be evaluated when the user account pronounces.
1. For the case of pronunciation assessment based on pronunciation data:
fig. 6 is a flow chart of a pronunciation assessment method for assisting in diagnosis of a sounding lesion according to an exemplary embodiment of the present application. The method may be used for a terminal or a client on a terminal in a system as shown in fig. 3. As shown in fig. 6, the method includes:
step 602: a pronunciation assessment interface for assisting in the diagnosis of a sounding lesion is displayed.
The pronunciation assessment interface is a user interface in the client for triggering pronunciation assessment. And displaying visual prompt information of the voice to be evaluated in the pronunciation evaluation interface. And the client side realizes pronunciation assessment on the user according to the pronunciation of the user aiming at the voice to be assessed. Optionally, for a pronunciation assessment, there are one or more voices to be assessed.
Step 604: and responding to the recording operation, and recording pronunciation data of the user account aiming at the voice to be evaluated.
After the client starts the process of pronunciation assessment, pronunciation data of the user account for the voice to be assessed is recorded. The client collects pronunciation data through the elements used for collecting sound on the terminal. Optionally, there is a limit to the recording duration for the pronunciation data, the recording duration being determined by the client.
Optionally, after the client starts the process of performing the pronunciation assessment, a pronunciation animation is displayed on the pronunciation assessment interface, where the pronunciation animation is used to reflect the dynamics of the process of making healthy sounds corresponding to the speech to be assessed by the human body pronunciation organ. And the sounding animation displayed in the sounding evaluation interface has a corresponding relation with the voice to be evaluated corresponding to the sounding evaluation interface. Optionally, the client automatically displays the sound animation before recording the sound data. Or displaying the sounding animation according to the triggering of the control in the sounding evaluation interface.
Illustratively, FIG. 7 is a schematic illustration of a sound animation provided by an exemplary embodiment of the present application. As shown in fig. 7, the client displays a pronunciation assessment interface 701, and the pronunciation assessment interface 701 is used for recording pronunciation data corresponding to "a". In the pronunciation assessment interface 701, the client displays a pronunciation animation 702, where the pronunciation animation 702 is used to reflect the dynamics of the process of making healthy "a" sounds by the human vocal organs.
Step 606: the formant characteristics to be evaluated of the pronunciation data are determined.
Optionally, the speech to be evaluated includes vowels to be evaluated corresponding to formant characteristics. The formant characteristics to be evaluated are used for pronunciation evaluation, and the formant characteristics to be evaluated are determined according to formants corresponding to audio frames of pronunciation data. Optionally, the formants include formant 1 (F1) and formant 2 (F2). After the client records the pronunciation data, the client determines the formants of each audio frame in the pronunciation data, and generates a formant histogram according to the formants of each audio frame, wherein the horizontal axis of the histogram is the audio frames arranged in time sequence, and the vertical axis is the formants of the audio frames. The client adjusts the formants of each audio frame according to the peaks in the formant histogram, e.g., dynamically programs the formants of the audio frames according to the peaks. The dynamic programming includes, for example, adjusting outliers, smoothing, etc. After adjusting the formants of each audio frame, the client determines the average value of the formants after adjusting each audio frame as the formant characteristics to be evaluated of the pronunciation data. Alternatively, when calculating the average value, a formant with a value of 0 is not included in the calculation.
Step 608: and determining an evaluation score according to the similarity degree between the formant characteristics to be evaluated and the health formant characteristics of the health pronunciation data.
The process of determining the health formant characteristics of the health pronunciation data by the client is the same as or different from the process of determining the formant characteristics to be evaluated. The higher the similarity of pronunciation data to healthy pronunciation data, the higher the evaluation score. The lower the similarity of pronunciation data to healthy pronunciation data, the lower the evaluation score. Alternatively, the client can determine the evaluation score by the following manner, and the client can also determine the evaluation score by more manners, which the present application is not limited to. In addition, when the client determines the evaluation score in a plurality of ways, the client can determine the average or weighted average of the evaluation scores determined in each way as the finally determined evaluation score.
Determining an evaluation score by fitting an ellipse:
optionally, the formant characteristics to be evaluated include formant 1 to be evaluated and formant 2 to be evaluated, and the healthy formant characteristics include healthy formant 1 and healthy formant 2. In a coordinate system with a first coordinate axis being a formant 1 and a second coordinate axis being a formant 2, the client fits an ellipse according to the position points of the healthy formants 1 and 2 of the plurality of healthy pronunciation data. For example, the client can determine the boundary of the distribution range of the location points according to the location points where the health pronunciation data are located, and then can realize fitting ellipse according to the boundary through multiple linear regression. The client determines the position points of the characteristics of the formants to be evaluated in the coordinate system according to the formants to be evaluated 1 and the formants to be evaluated 2 and determines the distance between the position points of the characteristics of the formants to be evaluated and the center of the ellipse. An evaluation score is then determined based on the distance, the distance being inversely related to the evaluation score.
For example, after the client fits an ellipse and determines a location point of the formant feature to be evaluated in the coordinate system, the client determines a vector r between the location point and the center of the ellipse, and a semi-major axis a and a semi-minor axis b of the ellipse. The client then calculates the projection of the vector in the long axis direction |ra| and the projection in the short axis direction |rb|, |ra| and |rb| represent the distance between the location point and the center of the ellipse. The client then calculates an evaluation score by the following formula:
optionally, the client can also fit a circle by the method, and determine the evaluation score according to the distance between the position point corresponding to the formant feature to be evaluated and the center of the circle. In practical applications, however, the manner of fitting the ellipse has a higher accuracy than the manner of fitting the circle.
Illustratively, FIG. 8 is a schematic illustration of a fitted ellipse provided by an exemplary embodiment of the present application. As shown in fig. 8, in the F1-F2 coordinate system, the client fits an ellipse 801 from healthy pronunciation data for vowel a, fits an ellipse 802 from healthy pronunciation data for vowel i, and fits an ellipse 803 from healthy pronunciation data for vowel u. The client may then determine an evaluation score for vowel a based on the distance between the location in the coordinate system corresponding to the pronunciation data for vowel a for the user account and the center of the ellipse 801. An evaluation score for vowel i may be determined based on the distance of the corresponding location of the user account's pronunciation data for vowel i in the coordinate system from the center of the ellipse 802. An evaluation score for vowel u may be determined based on the distance of the corresponding location of the user account's pronunciation data for vowel u in the coordinate system from the center of ellipse 803.
An evaluation score is determined by means of a gaussian mixture model (Gaussian Mixture Model, GMM):
optionally, the formant characteristics to be evaluated include formant 1 to be evaluated and formant 2 to be evaluated, and the healthy formant characteristics include healthy formant 1 and healthy formant 2. And the client can determine the probability density function value of the characteristic of the formants to be evaluated through a Gaussian mixture model according to the formants to be evaluated 1 and the formants to be evaluated 2. The parameters of the Gaussian mixture model are obtained through learning through an expected maximum algorithm according to a health formant 1 and a health formant 2 of a plurality of health pronunciation data, and the parameters comprise mean values, variances and weights corresponding to different Gaussian components. The client can determine the limit maximum probability density function value of the Gaussian mixture model according to the sum of products of the maximum probability density function values of the Gaussian mixture model under different Gaussian components and weights of the Gaussian components. And then determining an evaluation score according to the distance between the probability density function value of the formant feature to be evaluated and the limit maximum probability density function value, wherein the distance is inversely related to the evaluation score.
Illustratively, in calculating the distance, the client needs to determine the limiting maximum probability density function value pdf of the GMM max1 . In practice, this value is not necessarily available in the probability density function of the real GMM, and can only be taken when the GMM has only one gaussian component. Optionally, the vowels to be evaluated and the GMM have a one-to-one correspondence. I.e. the GMM is used to process the pronunciation data of the vowels to be evaluated corresponding thereto. The client side calculates the limit maximum probability density function value of the GMM through the following formula:
wherein w (i) is the weight of the ith Gaussian score, pdf (i) max Is the maximum probability density function value of the ith gaussian component.
The client side calculates the distance through the following formula:
dis=sqrt((-2)*[1n(pdf)-ln(pdf max1 )]);
wherein pdf is the probability density function value of the formant feature to be evaluated as determined by the gaussian mixture model.
The client side calculates the evaluation score through the following formula:
score=max(100-10*dis,0)。
illustratively, FIG. 9 is a schematic diagram of an equiprobable density curve provided by an exemplary embodiment of the present application. As shown in fig. 9, in the F1-F2 coordinate system, the client can determine the equiprobable density curve 901 by the GMM from the healthy sound data for the vowel a, can determine the equiprobable density curve 902 by the GMM from the healthy sound data for the vowel i, and can determine the equiprobable density curve 903 by the GMM from the healthy sound data for the vowel u. And then according to the characteristics of formants to be evaluated of the pronunciation data of different vowels to be evaluated, the probability density function values corresponding to the pronunciation data of different voices to be evaluated can be determined through the GMM, and the evaluation score is higher as the value is closer to the inner side of the curve.
Step 610: and displaying an evaluation result interface for sounding lesion reference, wherein an evaluation score of sounding quality is displayed in the evaluation result interface.
The evaluation score is used for evaluating the health degree of the pronunciation data compared with the health pronunciation data, wherein the health pronunciation data is pronunciation data of a normal person aiming at the voice to be evaluated. In the case that a plurality of voices to be evaluated exist, the evaluation score is determined according to the average value or weighted average value of scores corresponding to different voices to be evaluated. Optionally, the client may also display a ranking of the evaluation score, the ranking of the evaluation score being determined according to a magnitude relationship between the evaluation score and the score threshold. The threshold is determined manually, e.g. manually according to a number of clinical experiments. The rank of the evaluation score includes, for example, good, bad, and bad. In addition, in the case that a plurality of voices to be evaluated exist, the client can also display the evaluation score corresponding to each voice to be evaluated.
Illustratively, FIG. 10 is a schematic diagram of an assessment results interface provided by an exemplary embodiment of the present application. As shown in fig. 10, the client displays an evaluation result interface 1001, and the evaluation result interface 1001 displays an overall evaluation score 1002, and an evaluation score 1003 corresponding to each speech to be evaluated displayed by means of a radar chart, and a level 1004 of the evaluation score.
Step 612: and displaying a lesion prediction control on the evaluation result interface.
The lesion prediction control is used for triggering the client to predict whether sounding lesions appear according to sounding data of the user account. Optionally, the client displays an evaluation score threshold in the evaluation result interface, and in the case that the evaluation score is smaller than the evaluation score threshold, the client displays a lesion prediction control in the evaluation result interface. The evaluation score threshold is manually set. Optionally, for different voices to be evaluated, corresponding evaluation score thresholds are set. Alternatively, the same evaluation score threshold is set for different voices to be evaluated.
Illustratively, FIG. 11 is a schematic illustration of an assessment results interface provided by an exemplary embodiment of the present application. As shown in fig. 11, the client displays an evaluation result interface 1101 in which an evaluation score 1102 and an evaluation score threshold 1103 are displayed. In the event that the evaluation score 1102 is less than the evaluation score threshold 1103, the client will display a lesion prediction control 1104 at the evaluation results interface.
Step 614: and responding to the lesion prediction operation triggered on the lesion prediction control, and displaying a lesion prediction result on the evaluation result interface.
The lesion prediction result is used to predict whether a sounding lesion is present, and is used only as a reference, not as a result of medical diagnosis. Alternatively, in the case where the evaluation score is smaller than the evaluation score threshold, the client can also display the lesion prediction control not but directly the evaluation score and the lesion prediction result.
Optionally, in response to the lesion prediction operation, the client may further display a historical evaluation score of the user account for the voice to be evaluated on the evaluation result interface, so that the user can know the past pronunciation quality condition and the variation condition of the pronunciation quality for the voice to be evaluated.
Illustratively, FIG. 12 is a schematic diagram of an assessment results interface provided by an exemplary embodiment of the present application. As shown in fig. 12, the client displays an evaluation result interface 1201, and displays a lesion prediction result 1202 and a corresponding prompt message on the evaluation result interface 1201. The client also displays a historical evaluation score 1203 of the user account for the voice to be evaluated in a line graph manner on the evaluation result interface 1201.
Optionally, in response to the lesion prediction operation, the client may obtain historical pronunciation data of the user account for the speech to be evaluated. And then, according to the difference between the historical pronunciation data and the pronunciation data of the user account aiming at the voice to be evaluated, predicting a lesion prediction result is realized, so that the lesion prediction result is displayed on an evaluation result interface.
The client can obtain a lesion prediction result by inputting the characteristics of the historical pronunciation data and the characteristics of the pronunciation data into the classification model. Optionally, the characteristic of the enunciated data is a characteristic of an audio frame of the enunciated data or is a formant characteristic corresponding to the enunciated data. The classification model is derived by error training between the predictive and true labels. The classification model is implemented based on Neural Networks (NN). The prediction label is obtained by predicting whether a sample user has sounding lesion or not according to the characteristics of the first sounding data and the characteristics of the second sounding data by the classification model. The first sounding data is sounding data of the sample user in a first period, the second sounding data is sounding data of the sample user in a second period, and the real tag is used for reflecting whether sounding lesions of the sample user occur between the first period and the second period after the first period. Optionally, the sample user is selected randomly from the crowd, and can also be selected from the crowd with the sound production lesions, and the real label is judged manually. The first period and the second period are randomly selected, or in the case where the sounding lesion is determined to occur, the first period is selected before the time when the sounding lesion is determined to occur, and the second period is selected after the time when the sounding lesion is determined to occur.
Optionally, in order to further improve the accuracy of the model, different classification models can also be trained for different voices to be evaluated. In this case, the first pronunciation data is pronunciation data for the speech to be evaluated corresponding to the classification model, and the second pronunciation data is pronunciation data for the speech to be evaluated corresponding to the classification model. When the client predicts, the classification model corresponding to the voice to be evaluated and corresponding to the pronunciation data is used. When the predicted results for different voices to be evaluated are different, the client determines the predicted result with the largest occurrence number as a lesion predicted result. Or in the case that the predicted result of the sounding lesion is present, the client determines that the lesion predicted result is the sounding lesion. Or determining the lesion prediction result manually according to the prediction result corresponding to each voice to be evaluated.
Illustratively, FIG. 13 is a schematic diagram of a process for performing pronunciation assessment as provided by an exemplary embodiment of the present application. As shown in fig. 13, in step S1, the client displays a pronunciation assessment interface, starts pronunciation assessment, initializes the number of tests to 0, and determines the number of tests according to the number of voices to be assessed corresponding to the pronunciation assessment. In step S2, the client plays the reference audio of the currently evaluated speech to be evaluated. In step S3, the client collects pronunciation data of the user for the currently evaluated speech to be evaluated. In step S4, the client performs pronunciation assessment according to the collected pronunciation data. In step S5, the client will test the number of times +1. In step S6, the client determines whether the number of tests is complete. If the number of tests is not completed, step S2 is skipped. If the number of tests is complete, step S7 is skipped. In step S7, the client displays the evaluation result interface, and displays the evaluation score.
In summary, according to the method provided by the embodiment, the pronunciation data of the voice to be evaluated by the user is collected by displaying the pronunciation evaluation interface, so that the evaluation score is displayed on the evaluation result interface. The evaluation score can reflect the pronunciation quality of the user, so that pronunciation evaluation can be carried out on the user, and further diagnosis of the assisted sounding lesion can be realized. In the process, the medical equipment and the manual judgment are not needed, and the pronunciation assessment efficiency is improved.
The method provided by the embodiment also realizes pronunciation assessment by using formant characteristics of pronunciation data. The method is high in operation efficiency and capable of rapidly achieving pronunciation assessment. The efficiency of pronunciation assessment can be improved.
According to the method provided by the embodiment, the formant characteristics of the pronunciation data are determined by using the histogram technology, so that the accuracy of the determined formant characteristics can be improved, and the accuracy of pronunciation assessment is improved.
The method provided by this embodiment further determines the evaluation score by fitting an ellipse. The method is high in operation efficiency, high in accuracy and capable of rapidly determining the evaluation score.
The method provided by the present embodiment also determines an evaluation score by using a gaussian mixture model. The method is high in operation efficiency, high in accuracy and capable of rapidly determining the evaluation score.
The method provided by the embodiment also predicts whether the sounding lesion appears in the user by displaying the lesion prediction result, and provides valuable reference information for the user.
According to the method provided by the embodiment, the lesion prediction control is displayed under the condition that the evaluation score is smaller than the evaluation score threshold value, so that the interference of unnecessary elements displayed on the user interface to the user can be avoided.
The method provided by the embodiment also provides a mode capable of accurately predicting the lesion prediction result by predicting the lesion prediction result according to the difference between the historical pronunciation data and the current pronunciation data.
The method provided by the embodiment also realizes the prediction of the lesion prediction result through the classification model, realizes the artificial intelligence-based mode, and combines the past pronunciation data of the user to accurately predict the lesion prediction result. And the process does not need manual operation, so that the prediction efficiency is high.
The method provided by the embodiment also realizes the correct sounding mode for reminding the user by displaying the sounding animation, and is beneficial to the user to learn the correct sounding.
According to the method provided by the embodiment, the historical evaluation score of the user is displayed, so that the user can intuitively know the past sounding quality and the change condition of the sounding quality, and the user experience can be improved.
2. For the case of pronunciation assessment from pronunciation related data:
fig. 14 is a flow chart of a pronunciation assessment method for assisting in diagnosis of a sounding lesion according to an exemplary embodiment of the present application. The method may be used for a terminal or a client on a terminal in a system as shown in fig. 3. As shown in fig. 14, the method includes:
step 1402: a pronunciation assessment interface for assisting in the diagnosis of a sounding lesion is displayed.
The pronunciation assessment interface is a user interface in the client for triggering pronunciation assessment. And displaying visual prompt information of the voice to be evaluated in the pronunciation evaluation interface.
Step 1404: and responding to the recording operation, and acquiring pronunciation related data of the user account aiming at the voice to be evaluated.
The client acquires a face image of the user account by recording pronunciation data of the user account aiming at the voice to be evaluated, acquires sounding airflow data of the user account aiming at the voice to be evaluated, and jointly determines the pronunciation data, the face image and the sounding airflow data as pronunciation related data of the user account aiming at the voice to be evaluated, thereby realizing acquisition of the pronunciation related data.
Since different users have similar features on their faces (mainly near the mouth) when pronouncing for the same speech to be evaluated, the use of face images enables the pronunciation assessment. In addition, since the airflows generated by the lungs through the mouth have similar characteristics when different users pronounce for the same speech to be evaluated, the use of the sound-producing airflows data enables the pronunciation evaluation. Optionally, the client acquires a face image through a camera of the terminal, wherein the face image is an image of a face acquired in the pronunciation process of the voice to be evaluated, and the face image comprises one or more pieces. Because the microphone is structured to respond to the airflow, the client can collect sounding airflow data, which is the data of the airflow collected in the process of sounding the voice to be evaluated, through the microphone of the terminal. Optionally, after the client collects the sounding airflow data, according to the time correspondence between the sounding data and the sounding data, the part of the sounding data, where the sound is not recorded, is correspondingly removed from the sounding airflow data.
Step 1406: and determining an evaluation score according to the similarity degree of the pronunciation related data and the healthy pronunciation related data corresponding to the voice to be evaluated.
The evaluation score is used for evaluating the health degree of the pronunciation-related data compared with the health pronunciation-related data, wherein the health pronunciation-related data is pronunciation-related data of a normal person aiming at the voice to be evaluated. The evaluation score is positively correlated with the degree of similarity. The healthy pronunciation related data includes healthy pronunciation data, healthy face images, and healthy pronunciation airflow data. The healthy pronunciation data is pronunciation data collected when a normal person pronounces the speech to be evaluated. The healthy face image is a face image collected when a normal person pronounces the voice to be evaluated. Healthy sounding airflow data is sounding airflow data collected when a normal person pronounces a voice to be evaluated.
Optionally, the client extracts formant feature vectors of the pronunciation data, extracts face feature vectors of the face image, and extracts sounding airflow feature vectors of the sounding airflow data. And fusing the formant feature vector, the face feature vector and the sounding airflow feature vector to obtain a fused feature vector. And then inputting the fusion feature vector into a scoring model to obtain an evaluation score.
The scoring model is obtained through the difference training between the predicted score and the true score of the health fusion feature vector corresponding to the health pronunciation related data. The prediction score is obtained by predicting the health fusion feature vector through a scoring model, and the true score is obtained by manually labeling the health fusion feature vector. The health fusion feature vector is obtained by extracting feature vectors of health pronunciation data, health face images and health sound production airflow data and then fusing the extracted feature vectors of three modes. The scoring model is implemented based on Neural Networks (NN). Optionally, in order to further improve the accuracy of the scoring model, different scoring models can also be trained for different voices to be evaluated. Specifically, in this case, only the data corresponding to the speech to be evaluated corresponding to the scoring model is input during the training and use process of the scoring model.
Step 1408: and displaying an evaluation result interface for sounding lesion reference, wherein an evaluation score of sounding quality is displayed in the evaluation result interface.
The evaluation score is used to evaluate the health of the pronunciation-related data as compared to the health pronunciation-related data. In the case that a plurality of voices to be evaluated exist, the evaluation score is determined according to the average value or weighted average value of scores corresponding to different voices to be evaluated.
Step 1410: and displaying a lesion prediction control on the evaluation result interface.
The lesion prediction control is used for triggering the client to predict whether sounding lesions appear according to sounding data of the user account. Optionally, the client may display an evaluation score threshold in the evaluation result interface, and in the case where the evaluation score is less than the evaluation score threshold, the client displays a lesion prediction control in the evaluation result interface.
Step 1412: and responding to the lesion prediction operation triggered on the lesion prediction control, and displaying a lesion prediction result on the evaluation result interface.
The lesion prediction result is used to predict whether a sounding lesion is present, and is used only as a reference, not as a result of medical diagnosis.
It should be noted that the above steps can also be performed by the server in combination with the client. For example, the client terminal collects the relevant data of the user account for pronunciation assessment and then sends the relevant data to the server. The server determines the evaluation score and/or the lesion prediction result of the user account according to the related data sent by the client through the method, and sends the evaluation score and/or the lesion prediction result to the client. The client may then display the assessment score and/or lesion prediction results on an assessment results interface.
In summary, according to the method provided by the embodiment, the pronunciation related data of the user for the voice to be evaluated is collected by displaying the pronunciation evaluation interface, so that the evaluation score is displayed on the evaluation result interface. The evaluation score can reflect the pronunciation quality of the user, so that pronunciation evaluation can be carried out on the user, and further diagnosis of the assisted sounding lesion can be realized. In the process, the medical equipment and the manual judgment are not needed, and the pronunciation assessment efficiency is improved.
According to the method provided by the embodiment, the pronunciation assessment is realized by using the pronunciation data, the face image and the sounding airflow data, and more dimension data are added for the pronunciation assessment, so that the accuracy of the pronunciation assessment can be improved.
According to the method provided by the embodiment, the evaluation score is determined by extracting the feature vector of the pronunciation related data and carrying out multi-mode feature fusion on the feature vector, and the evaluation score is accurately predicted in an artificial intelligence based mode by adopting a scoring model according to the feature vector obtained by fusion. And the process does not need manual operation, so that the pronunciation assessment efficiency can be further improved.
It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant countries and regions. For example, pronunciation data and history evaluation score, etc., involved in the present application are obtained with sufficient authorization.
It should be noted that, the sequence of the steps of the method provided in the embodiment of the present application may be appropriately adjusted, the steps may also be increased or decreased according to the situation, and any method that is easily conceivable to be changed by those skilled in the art within the technical scope of the present disclosure should be covered within the protection scope of the present disclosure, so that no further description is given.
Fig. 15 is a schematic structural view of a pronunciation assessment device for assisting diagnosis of a sounding lesion according to an exemplary embodiment of the present application. The apparatus may be used for a terminal in a system as shown in fig. 1. As shown in fig. 15, the apparatus includes:
the display module 1501 is configured to display a pronunciation assessment interface for assisting in diagnosis of a sounding lesion, where visual cue information of a voice to be assessed is displayed.
The recording module 1502 is configured to record pronunciation data of the voice to be evaluated for the user account in response to the recording operation.
The display module 1501 is further configured to display an evaluation result interface for sounding lesion reference, where an evaluation score of sounding quality is displayed, where the evaluation score is used to evaluate a health degree of sounding data compared with health sounding data, the health sounding data being sounding data of a normal person for a voice to be evaluated.
In an alternative design, the speech to be evaluated includes vowels to be evaluated that correspond to formant characteristics.
As shown in fig. 16, the apparatus further includes:
a determining module 1503 is configured to determine a formant characteristic to be evaluated of the pronunciation data.
The determining module 1503 is further configured to determine an evaluation score according to a similarity degree between the formant feature to be evaluated and the healthy formant feature of the healthy pronunciation data.
The display module 1501 is configured to display an evaluation result interface and display an evaluation score on the evaluation result interface.
In an alternative design, the determining module 1503 is configured to:
formants for each audio frame in the voicing data are determined. A formant histogram is generated from formants for each audio frame. The formants of each audio frame are adjusted according to the peaks in the formant histogram. And determining the average value of the formants after adjustment of each audio frame as the formant characteristics to be evaluated of the pronunciation data.
In an alternative design, the formant characteristics to be evaluated include formant 1 to be evaluated and formant 2 to be evaluated, and the healthy formant characteristics include healthy formant 1 and healthy formant 2. A determining module 1503, configured to:
in a coordinate system with a first coordinate axis being a formant 1 and a second coordinate axis being a formant 2, fitting an ellipse according to the position points of the healthy formants 1 and 2 of the plurality of healthy pronunciation data. And determining the position points of the characteristics of the formants to be evaluated in the coordinate system according to the formants to be evaluated 1 and the formants to be evaluated 2. The distance between the location point of the formant feature to be evaluated and the center of the ellipse is determined. An evaluation score is determined from the distances, the distances being inversely related to the evaluation score.
In an alternative design, the formant characteristics to be evaluated include formant 1 to be evaluated and formant 2 to be evaluated, and the healthy formant characteristics include healthy formant 1 and healthy formant 2. A determining module 1503, configured to:
according to the formants 1 and 2 to be evaluated, determining probability density function values of the formant characteristics to be evaluated through a Gaussian mixture model, wherein parameters of the Gaussian mixture model are obtained through learning through an expected maximum algorithm according to the health formants 1 and 2 of a plurality of health pronunciation data. And determining the limit maximum probability density function value of the Gaussian mixture model according to the sum of products of the maximum probability density function values of the Gaussian mixture model under different Gaussian components and weights of the Gaussian components. And determining an evaluation score according to the distance between the probability density function value of the formant feature to be evaluated and the limit maximum probability density function value, wherein the distance is inversely related to the evaluation score.
In an alternative design, as shown in fig. 17, the apparatus further comprises:
the obtaining module 1504 is configured to obtain a face image of the user account in response to the recording operation, and obtain sounding airflow data of the user account for the voice to be evaluated.
The determining module 1503 is configured to determine pronunciation data, a face image, and sounding airflow data together as pronunciation related data of the user account for the voice to be evaluated.
The display module 1501 is configured to display an evaluation result interface and display an evaluation score in the evaluation result interface, where the evaluation score is used to evaluate the health degree of the pronunciation related data compared with the health pronunciation related data, and the health pronunciation related data is pronunciation related data of a normal person for a voice to be evaluated.
In an alternative design, as shown in fig. 18, the apparatus further comprises:
the extracting module 1505 is configured to extract formant feature vectors of the pronunciation data. And extracting the face feature vector of the face image. And extracting the sounding airflow characteristic vector of the sounding airflow data.
The fusion module 1506 is configured to fuse the formant feature vector, the face feature vector, and the feature vector of the three modes of the sounding airflow feature vector to obtain a fused feature vector.
The determining module 1503 is configured to input the fusion feature vector into a scoring model to obtain an evaluation score, where the scoring model is obtained by training a difference between a prediction score of the health fusion feature vector corresponding to the health pronunciation related data and a true score, the prediction score is obtained by predicting the health fusion feature vector through the scoring model, and the true score is obtained by manually labeling the health fusion feature vector.
The display module 1501 is configured to display an evaluation result interface and display an evaluation score on the evaluation result interface.
In an alternative design, display module 1501 is used to:
and displaying a lesion prediction control on the evaluation result interface. And responding to the lesion prediction operation triggered on the lesion prediction control, and displaying a lesion prediction result on an evaluation result interface, wherein the lesion prediction result is used for predicting whether sounding lesions appear.
In an alternative design, display module 1501 is used to:
and displaying an evaluation score threshold in an evaluation result interface. And displaying a lesion prediction control on the evaluation result interface under the condition that the evaluation score is smaller than the evaluation score threshold value.
In an alternative design, as shown in fig. 19, the apparatus further comprises:
and the obtaining module 1504 is configured to obtain historical pronunciation data of the user account for the voice to be evaluated in response to the lesion prediction operation.
The prediction module 1507 is configured to predict a lesion prediction result according to a difference between the historical pronunciation data and the pronunciation data.
And the display module 1501 is used for displaying the lesion prediction result on the evaluation result interface.
In an alternative design, the prediction module 1507 is configured to:
and inputting the characteristics of the historical pronunciation data and the characteristics of the pronunciation data into a classification model to obtain a lesion prediction result. The classification model is obtained through error training between a prediction label and a real label, the prediction label is obtained by predicting whether a sample user has sounding lesions according to the characteristics of first sounding data and the characteristics of second sounding data, the first sounding data is sounding data of the sample user in a first period, the second sounding data is sounding data of the sample user in a second period, the real label is used for reflecting whether the sample user has sounding lesions between the first period and the second period after the first period.
In an alternative design, display module 1501 is used to:
and displaying a sounding animation on the sounding evaluation interface, wherein the sounding animation is used for reflecting the dynamic state of the process that the human sounding organ emits healthy sound corresponding to the voice to be evaluated.
In an alternative design, display module 1501 is used to:
and responding to the lesion prediction operation, and displaying a historical evaluation score of the user account for the voice to be evaluated on an evaluation result interface.
It should be noted that: the pronunciation assessment device for assisting diagnosis of a sounding lesion provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the pronunciation assessment device for assisting in diagnosis of a sounding lesion provided in the above embodiment and the pronunciation assessment method embodiment for assisting in diagnosis of a sounding lesion belong to the same concept, and detailed implementation processes thereof are shown in the method embodiment and are not repeated here.
Embodiments of the present application also provide a computer device comprising: the pronunciation assessment method for assisting in diagnosing the sounding lesion provided by the above method embodiments is implemented by loading and executing at least one instruction, at least one section of program, code set or instruction set by the processor.
Optionally, the computer device is a terminal. Fig. 20 is a schematic structural view of a terminal according to an exemplary embodiment of the present application.
In general, the terminal 2000 includes: a processor 2001 and a memory 2002.
Processor 2001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 2001 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). Processor 2001 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 2001 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 2001 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
Memory 2002 may include one or more computer-readable storage media, which may be non-transitory. Memory 2002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 2002 is used to store at least one instruction for execution by processor 2001 to implement the pronunciation assessment method for assisting in the diagnosis of a sounding lesion provided by a method embodiment of the present application.
In some embodiments, the terminal 2000 may further optionally include: a peripheral interface 2003 and at least one peripheral. The processor 2001, memory 2002, and peripheral interface 2003 may be connected by a bus or signal line. The respective peripheral devices may be connected to the peripheral device interface 2003 through a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 2004, a display 2005, a camera assembly 2006, audio circuitry 2007, and a power supply 2008.
Peripheral interface 2003 may be used to connect I/O (Input/Output) related at least one peripheral device to processor 2001 and memory 2002. In some embodiments, processor 2001, memory 2002, and peripheral interface 2003 are integrated on the same chip or circuit board; in some other embodiments, either or both of processor 2001, memory 2002, and peripheral interface 2003 may be implemented on separate chips or circuit boards, as embodiments of the application are not limited in this regard.
The Radio Frequency circuit 2004 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 2004 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 2004 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 2004 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 2004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 2004 may also include NFC (Near Field Communication, short range wireless communication) related circuitry, which is not limiting of the application.
The display 2005 is used to display a UI (User Interface, horizontal checkpoint Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 2005 is a touch display, the display 2005 also has the ability to capture touch signals at or above the surface of the display 2005. The touch signal may be input to the processor 2001 as a control signal for processing. At this point, the display 2005 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 2005 may be one, providing a front panel of the terminal 2000; in other embodiments, the display 2005 may be at least two, respectively disposed on different surfaces of the terminal 2000 or in a folded design; in still other embodiments, the display 2005 may be a flexible display disposed on a curved surface or a folded surface of the terminal 2000. Even more, the display 2005 may be arranged in an irregular pattern that is not rectangular, i.e., a shaped screen. The display 2005 can be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.
The camera assembly 2006 is used to capture images or video. Optionally, the camera assembly 2006 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal 2000 and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, the camera assembly 2006 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.
Audio circuitry 2007 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 2001 for processing, or inputting the electric signals to the radio frequency circuit 2004 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 2000. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is then used to convert electrical signals from the processor 2001 or the radio frequency circuit 2004 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 2007 may also include a headphone jack.
Power supply 2008 is used to power the various components in terminal 2000. The power source 2008 may be alternating current, direct current, disposable battery, or rechargeable battery. When power supply 2008 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 2000 can further include one or more sensors 2009. The one or more sensors 2009 include, but are not limited to: acceleration sensor 2010, gyro sensor 2011, pressure sensor 2012, optical sensor 2013, and proximity sensor 2014.
The acceleration sensor 2010 may detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the terminal 2000. For example, the acceleration sensor 2010 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 2001 may control the touch display 2005 to display a landscape gate interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 2010. Acceleration sensor 2010 may also be used for gathering motion data for a game or user.
The gyro sensor 2011 may detect a body direction and a rotation angle of the terminal 2000, and the gyro sensor 2011 may collect a 3D motion of the user to the terminal 2000 in cooperation with the acceleration sensor 2010. The processor 2001 may implement the following functions based on the data collected by the gyro sensor 2011: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.
Pressure sensor 2012 may be disposed at a side frame of terminal 2000 and/or an underlying layer of touch display 2005. When the pressure sensor 2012 is disposed at a side frame of the terminal 2000, a grip signal of the terminal 2000 by a user may be detected, and the processor 2001 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 2012. When the pressure sensor 2012 is disposed below the touch display 2005, control of the operability control on the UI interface is achieved by the processor 2001 according to a user's pressure operation on the touch display 2005. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
The optical sensor 2013 is used to collect the ambient light intensity. In one embodiment, the processor 2001 may control the display brightness of the touch display 2005 based on the ambient light intensity collected by the optical sensor 2013. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display 2005 is turned up; when the ambient light intensity is low, the display brightness of the touch display 2005 is turned down. In another embodiment, the processor 2001 may also dynamically adjust the shooting parameters of the camera assembly 2006 based on the ambient light intensity collected by the optical sensor 2013.
A proximity sensor 2014, also referred to as a distance sensor, is typically provided at the front panel of the terminal 2000. The proximity sensor 2014 is used to collect a distance between a user and the front surface of the terminal 2000. In one embodiment, when the proximity sensor 2014 detects that the distance between the user and the front surface of the terminal 2000 becomes gradually smaller, the processor 2001 controls the touch display 2005 to switch from the bright screen state to the off screen state; when the proximity sensor 2014 detects that the distance between the user and the front surface of the terminal 2000 gradually increases, the processor 2001 controls the touch display 2005 to switch from the off-screen state to the on-screen state.
It will be appreciated by those skilled in the art that the structure shown in fig. 20 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.
The embodiment of the application also provides a computer readable storage medium, at least one program code is stored in the computer readable storage medium, and when the program code is loaded and executed by a processor of a computer device, the pronunciation assessment method for assisting the diagnosis of the sounding lesion provided by the embodiment of the method is realized.
The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the pronunciation assessment method for assisting in the diagnosis of a sounding lesion provided by the above-described method embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above readable storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (17)

1. A pronunciation assessment method for assisting in the diagnosis of a sounding lesion, the method comprising:
Displaying a pronunciation assessment interface for assisting in diagnosis of a pronunciation lesion, wherein visual prompt information of a voice to be assessed is displayed in the pronunciation assessment interface;
recording pronunciation data of a user account aiming at the voice to be evaluated in response to recording operation;
and displaying an evaluation result interface for sounding lesion reference, wherein an evaluation score of sounding quality is displayed in the evaluation result interface and is used for evaluating the health degree of the sounding data compared with healthy sounding data, and the healthy sounding data is sounding data of a normal person aiming at the voice to be evaluated.
2. The method of claim 1, wherein the speech to be evaluated comprises vowels to be evaluated that correspond to formant characteristics; the method for displaying the evaluation result interface for sounding lesion reference, wherein the evaluation result interface displays the evaluation score of sounding quality, and comprises the following steps:
determining formant characteristics to be evaluated of the pronunciation data;
determining the evaluation score according to the similarity degree between the formant characteristics to be evaluated and the health formant characteristics of the health pronunciation data;
displaying the evaluation result interface and displaying the evaluation score on the evaluation result interface.
3. The method of claim 2, wherein said determining formant characteristics to be evaluated of the pronunciation data comprises:
determining formants for each audio frame in the voicing data;
generating a formant histogram according to formants of each audio frame;
adjusting formants of each audio frame according to peaks in the formant histogram;
and determining the average value of the formants after the adjustment of each audio frame as the formant characteristics to be evaluated of the pronunciation data.
4. The method of claim 2, wherein the formant characteristics to be evaluated include formant 1 to be evaluated and formant 2 to be evaluated, the healthy formant characteristics including healthy formant 1 and healthy formant 2;
the determining the evaluation score according to the similarity degree between the formant characteristics to be evaluated and the health formant characteristics of the health pronunciation data comprises the following steps:
fitting ellipses according to the position points of the healthy formants 1 and 2 of the healthy pronunciation data in a coordinate system with a first coordinate axis being a formant 1 and a second coordinate axis being a formant 2;
determining the position points of the formant characteristics to be evaluated in the coordinate system according to the formants to be evaluated 1 and the formants to be evaluated 2;
Determining a distance between a location point of the formant feature to be evaluated and a center of the ellipse;
and determining the evaluation score according to the distance, wherein the distance is inversely related to the evaluation score.
5. The method of claim 2, wherein the formant characteristics to be evaluated include formant 1 to be evaluated and formant 2 to be evaluated, the healthy formant characteristics including healthy formant 1 and healthy formant 2;
the determining the evaluation score according to the similarity degree between the formant characteristics to be evaluated and the health formant characteristics of the health pronunciation data comprises the following steps:
according to the formants 1 and 2 to be evaluated, determining probability density function values of the formant characteristics to be evaluated through a Gaussian mixture model, wherein parameters of the Gaussian mixture model are obtained through learning through an expected maximum algorithm according to healthy formants 1 and 2 of the healthy pronunciation data;
determining a limit maximum probability density function value of the Gaussian mixture model according to the sum of products of maximum probability density function values of the Gaussian mixture model under different Gaussian components and weights of the Gaussian components;
And determining the evaluation score according to the distance between the probability density function value of the formant characteristic to be evaluated and the limit maximum probability density function value, wherein the distance is inversely related to the evaluation score.
6. The method according to any one of claims 1 to 5, further comprising:
responding to the recording operation, acquiring a face image of the user account, and acquiring sounding airflow data of the user account for the voice to be evaluated;
jointly determining the pronunciation data, the face image and the sounding airflow data as pronunciation related data of the user account for the voice to be evaluated;
the method for displaying the evaluation result interface for sounding lesion reference, wherein the evaluation result interface displays the evaluation score of sounding quality, and comprises the following steps:
displaying the evaluation result interface and displaying the evaluation score in the evaluation result interface, wherein the evaluation score is used for evaluating the health degree of the pronunciation related data compared with health pronunciation related data, and the health pronunciation related data is pronunciation related data of a normal person aiming at the voice to be evaluated.
7. The method of claim 6, wherein the displaying the assessment results interface and displaying the assessment score in the assessment results interface comprises:
extracting formant feature vectors of the pronunciation data; extracting a face feature vector of the face image; extracting sound-producing airflow characteristic vectors of the sound-producing airflow data;
fusing the formant feature vector, the face feature vector and the feature vector of the sounding airflow feature vector to obtain a fused feature vector;
inputting the fusion feature vector into a scoring model to obtain the evaluation score, wherein the scoring model is obtained through difference training between a prediction score and a true score of a health fusion feature vector corresponding to the health pronunciation related data, the prediction score is obtained through prediction of the health fusion feature vector by the scoring model, and the true score is obtained through manual labeling of the health fusion feature vector;
displaying the evaluation result interface and displaying the evaluation score on the evaluation result interface.
8. The method according to any one of claims 1 to 5, further comprising:
Displaying a lesion prediction control on the evaluation result interface;
and responding to the lesion prediction operation triggered on the lesion prediction control, and displaying a lesion prediction result on the evaluation result interface, wherein the lesion prediction result is used for predicting whether sounding lesions appear.
9. The method of claim 8, wherein displaying a lesion prediction control at the assessment results interface comprises:
displaying an assessment score threshold in the assessment results interface;
and displaying the lesion prediction control on the evaluation result interface in the case that the evaluation score is smaller than the evaluation score threshold value.
10. The method of claim 8, wherein the displaying the lesion prediction result at the evaluation result interface in response to the triggered lesion prediction operation on the lesion prediction control comprises:
responding to the lesion prediction operation, and acquiring historical pronunciation data of the user account aiming at the voice to be evaluated;
predicting the lesion prediction result according to the difference between the historical pronunciation data and the pronunciation data;
and displaying the lesion prediction result on the evaluation result interface.
11. The method of claim 10, wherein predicting the lesion prediction result based on the difference between the historical pronunciation data and the pronunciation data comprises:
Inputting the characteristics of the historical pronunciation data and the characteristics of the pronunciation data into a classification model to obtain the lesion prediction result;
the classification model is obtained through error training between a prediction label and a real label, the prediction label is obtained by predicting whether a sample user has sounding lesions according to the characteristics of first sounding data and the characteristics of second sounding data, the first sounding data is sounding data of the sample user in a first period, the second sounding data is sounding data of the sample user in a second period, the second period is after the first period, and the real label is used for reflecting whether the sample user has sounding lesions between the first period and the second period.
12. The method according to any one of claims 1 to 5, further comprising:
and displaying a sounding animation on the sounding evaluation interface, wherein the sounding animation is used for reflecting the dynamic state of the process that the human sounding organ emits the healthy sound corresponding to the voice to be evaluated.
13. The method of claim 8, wherein the method further comprises:
And responding to the lesion prediction operation, and displaying a historical evaluation score of the user account for the voice to be evaluated on an evaluation result interface.
14. A pronunciation assessment device for assisting in the diagnosis of a sounding lesion, the device comprising:
the display module is used for displaying a pronunciation assessment interface for assisting in diagnosis of the sounding lesions, and visual prompt information of the voice to be assessed is displayed in the pronunciation assessment interface;
the recording module is used for responding to the recording operation and recording pronunciation data of the user account aiming at the voice to be evaluated;
the display module is further configured to display an evaluation result interface for sounding lesion reference, where an evaluation score of sounding quality is displayed in the evaluation result interface, where the evaluation score is used to evaluate a health degree of the sounding data compared with health sounding data, where the health sounding data is sounding data of the normal person for the voice to be evaluated.
15. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, code set, or instruction set being loaded and executed by the processor to implement the pronunciation assessment method for assisting in the diagnosis of a sounding lesion according to any one of claims 1 to 13.
16. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the pronunciation assessment method for assisting in the diagnosis of a sounding lesion according to any one of claims 1 to 13.
17. A computer program product, characterized in that the computer program product comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to perform the pronunciation assessment method for assisting in the diagnosis of a sounding lesion according to any one of claims 1 to 13.
CN202210189721.6A 2022-02-28 2022-02-28 Pronunciation assessment method, device and equipment for assisting diagnosis of sounding lesions Pending CN116687343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210189721.6A CN116687343A (en) 2022-02-28 2022-02-28 Pronunciation assessment method, device and equipment for assisting diagnosis of sounding lesions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210189721.6A CN116687343A (en) 2022-02-28 2022-02-28 Pronunciation assessment method, device and equipment for assisting diagnosis of sounding lesions

Publications (1)

Publication Number Publication Date
CN116687343A true CN116687343A (en) 2023-09-05

Family

ID=87832698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210189721.6A Pending CN116687343A (en) 2022-02-28 2022-02-28 Pronunciation assessment method, device and equipment for assisting diagnosis of sounding lesions

Country Status (1)

Country Link
CN (1) CN116687343A (en)

Similar Documents

Publication Publication Date Title
EP4006901A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
CN110379430B (en) Animation display method and device based on voice, computer equipment and storage medium
CN105654952B (en) Electronic device, server and method for outputting voice
US20220172737A1 (en) Speech signal processing method and speech separation method
WO2020224479A1 (en) Method and apparatus for acquiring positions of target, and computer device and storage medium
CN110322760B (en) Voice data generation method, device, terminal and storage medium
EP3373301A1 (en) Apparatus, robot, method and recording medium having program recorded thereon
CN111564152B (en) Voice conversion method and device, electronic equipment and storage medium
CN111524501B (en) Voice playing method, device, computer equipment and computer readable storage medium
CN110992927B (en) Audio generation method, device, computer readable storage medium and computing equipment
US20240296657A1 (en) Video classification method and apparatus
CN111359209B (en) Video playing method and device and terminal
CN111683329B (en) Microphone detection method, device, terminal and storage medium
CN111048109A (en) Acoustic feature determination method and apparatus, computer device, and storage medium
CN111428079B (en) Text content processing method, device, computer equipment and storage medium
CN114299935A (en) Awakening word recognition method, awakening word recognition device, terminal and storage medium
CN113362836A (en) Vocoder training method, terminal and storage medium
CN113205569A (en) Image drawing method and device, computer readable medium and electronic device
CN111652624A (en) Ticket buying processing method, ticket checking processing method, device, equipment and storage medium
KR20210100831A (en) System and method for providing sign language translation service based on artificial intelligence
CN116956814A (en) Punctuation prediction method, punctuation prediction device, punctuation prediction equipment and storage medium
CN116687343A (en) Pronunciation assessment method, device and equipment for assisting diagnosis of sounding lesions
CN115394285A (en) Voice cloning method, device, equipment and storage medium
WO2021147417A1 (en) Voice recognition method and apparatus, computer device, and computer-readable storage medium
CN111028823B (en) Audio generation method, device, computer readable storage medium and computing equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40094530

Country of ref document: HK