ISCA - International Speech
Communication Association


  • Home
  • Post a New Job Offer
  • 2024-01-04 15:21 | Anonymous

    L’équipe SAMoVA de l’IRIT à Toulouse propose plusieurs stages (M1, M2, PFE ingénieur) en 2024 autour des thématiques suivantes (liste non exhaustive) :

     

    - Génération Automatique De Partitions Musicales Dans Le Style Choro

    - Compréhension De La Parole Et IA Au Service De L’Analyse Sensorielle

    - Caractérisation Du Comportement Alimentaire Par Des Analyses Vidéo Et Multimodale

    - Adaptations De Systèmes De Reconnaissance Automatique De Parole En Contexte Pathologique

    - Traitement De Signal Et IA Pour Révéler Des Troubles Articulatoires En Production De Parole Atypique

    - End-To-End Speech Recognition For Assessing Comprehension Skills Of Children Learning To Read

    - Active Learning For Speaker Diarization

    - Modélisation Automatique Du Rythme De La Parole

    - Transcription de Verbalisations pour l’Analyse du Discours lors de Scénarios en Réalité Virtuelle

    - Mise en œuvre d’un prototype de reconnaissance vocale comparative appliqué à l’apprentissage du langage oral

     

    Tous les détails (sujets, contacts) sont disponibles dans la section 'Jobs' de l’équipe :
    https://www.irit.fr/SAMOVA/site/jobs/
  • 2024-01-04 15:20 | Anonymous

    Offre Post-doc – Linguistique / linguistique computationnelle 

     

    Durée :            9 mois

    Début :            janvier ou février 2024, un début au mois de mars 2024 est négociable

    Lieu :               LIUM – Le Mans Université

    Salaire net :     environ 2 000 €/mois, variable selon les compétences

    Contact :         jane.wottawa@univ-lemans.fr, richard.dufour@univ-nantes.fr

    Candidature :  Lettre de motivation, CV (3 pages maximum)


    Dans le cadre du projet DIETS qui s’intéresse particulièrement aux métriques d’évaluation de systèmes automatiques de reconnaissance de la parole, une position post-doc est prévue pour 

    a)     Mener une analyse linguistique et grammaticale sur les erreurs de sorties de systèmes automatiques de reconnaissance de la parole

    b)    Mener des tests d’évaluation humaine en fonction de différents types d’erreurs 

    c)     Comparer les choix des tests d’évaluation avec les évaluations faites par des métriques automatiques

    d)    Publication des résultats (conférences, journaux)

     

     

    Le projet DIETS

     

    L'un des problèmes majeurs des mesures d'évaluation du traitement des langues est qu'elles sont conçues pour mesurer globalement une solution proposée par rapport à une référence considérée, l'objectif principal étant de pouvoir comparer les systèmes entre eux. Le choix des mesures d'évaluation utilisées est très souvent crucial puisque les recherches entreprises pour améliorer ces systèmes sont basées sur ces mesures. Alors que les systèmes automatiques, comme la transcription de la parole, s'adressent à des utilisateurs finaux, ils sont finalement peu étudiés : l'impact de ces erreurs automatiques sur les humains, et la manière dont elles sont perçues au niveau cognitif, n'ont pas été étudiés, puis finalement intégrés dans le processus d'évaluation.

     

    Le projet DIETS, financé par l'Agence Nationale de la Recherche (2021-2024) et porté par le Laboratoire Informatique d'Avignon, propose de se focaliser sur la problématique du diagnostic/évaluation des systèmes de reconnaissance automatique de la parole (RAP) de bout en bout, basés sur des architectures de réseaux de neurones profonds, en intégrant la réception humaine des erreurs de transcription d'un point de vue cognitif. Le défi est ici double :

     

        1) Analyser finement les erreurs de RAP à partir d'une réception humaine.

     

        2) Comprendre et détecter comment ces erreurs se manifestent dans un cadre ASR de bout en bout, dont le travail est inspiré par le fonctionnement du cerveau humain.

     

    Le projet DIETS vise à repousser les limites actuelles concernant la compréhension des systèmes ASR de bout en bout, et à initier de nouvelles recherches intégrant une approche transversale (informatique, linguistique, sciences cognitives...) en replaçant l'humain au centre du développement des systèmes automatiques.

     

     

    Compétences requises 

     

    L’offre de poste requiert les compétences suivantes : une bonne maîtrise de l’orthographe et de la grammaire française nécessaires pour catégoriser d’une manière informée les erreurs de différents systèmes de transcription et des compétences numériques puisqu’il faudrait récupérer les données à partir d’un serveur. Une formation en linguistique ou linguistique computationnelle est souhaitée. 

    Une expérience dans l’organisation, la réalisation et l’analyse de tests comportementaux est un plus. 

     

    Lieu d’accueil 

     

    La structure d’accueil est le LIUM, laboratoire d’informatique de Le Mans Université situé au Mans. Une présence régulière au laboratoire est requise tout au long du Post-doc. Le LIUM est composé de deux équipes. Le post-doc se déroulera dans l’équipe LST qui développe ses activités de recherche dans le domaine du traitement automatique des langues naturelles sous forme de texte et de parole. Elle travaille avec des approches guidées par les données mais l'équipe est également spécialisée dans le deep learningappliqué au traitement des langues. L’équipe est actuellement composée d’une chargée de projets, de 11 enseignants-chercheurs (informaticiens, acousticiens, linguistes), de 4 chercheurs-doctorants et de deux masterants apprentis.

  • 2024-01-04 15:19 | Anonymous


    Postdoctoral Scholar | Data Sciences and Artificial Intelligence at Penn State University

    The Data Sciences and Artificial Intelligence (DS/AI) group at Penn State invites applications for a Postdoctoral Scholar position, set to commence in Fall 2024. This role is centered on cutting-edge research at the nexus of machine learning, deep learning, computer vision, psychology, and biology, with foci on psychology-inspired AI and addressing significant biological questions using AI.

    To Apply: https://psu.wd1.myworkdayjobs.com/en-US/PSU_Academic/job/Postdoctoral-Scholar---College-of-IST-Data-Sciences-and-Artificial-Intelligence_REQ_0000050584-1

    Qualifications:

    • Ph.D. in computer science, A.I., data science, physics, or neuroscience with an emphasis on machine learning, or a closely related field. To qualify, candidates must possess a Ph.D. or terminal degree before their employment starts at Penn State.

    • A strong record of publications in high-impact journals or premier peer-reviewed international conferences.

    • Prior experience in conducting interdisciplinary/multidisciplinary research is a plus.

     

    About the position:

    The successful candidate will be designated as a Postdoctoral Scholar at the College of Information Sciences and Technology (IST) of The Pennsylvania State University. The initial term of the position is for one year, with the possibility of renewal upon performance and fund availability. The scholar will be engaged in two interdisciplinary projects funded by the National Science Foundation, receiving mentorship from Professors James Wang (IST), Brad Wyble (Psychology), and Charles Anderson (Biology). The scholar will collaborate with highly motivated and talented graduate students and benefit from strong career development support, which includes training in teaching, grant proposal writing, and other collaborative work. Qualified candidates will have the ability to teach in IST after successfully completing one semester with approval from college leadership.

     

    To apply:

    • Please submit a CV, research statement (max 3 pages), and other pertinent documents in a single PDF document with the application.

    • Deadline: February 29, 2024, for full consideration. Late applications are accepted but given secondary priority.

    • Only shortlisted candidates will be contacted to provide reference letters.

    • For inquiries, please email with the subject line “postdoc” to Professor James Wang at jwang@ist.psu.edu or visit the lab website http://wang.ist.psu.edu.

     

    COMMITMENT TO DIVERSITY:

    The College of IST is strongly committed to a diverse community and to providing a welcoming and inclusive environment for faculty, staff and students of all races, genders, and backgrounds. The College of IST is committed to making good faith efforts to recruit, hire, retain, and promote qualified individuals from underrepresented minority groups including women, persons of color, diverse gender identities, individuals with disabilities, and veterans. We invite applicants to address their engagement in or commitment to inclusion, equity, and diversity issues as they relate to broadening participation in the disciplines represented in the college as well as aligning with the mission of the College of IST in a separate statement.

     

    CAMPUS SECURITY CRIME STATISTICS:

    Pursuant to the Jeanne Clery Disclosure of Campus Security Policy and Campus Crime Statistics Act and the Pennsylvania Act of 1988, Penn State publishes a combined Annual Security and Annual Fire Safety Report (ASR). The ASR includes crime statistics and institutional policies concerning campus security, such as those concerning alcohol and drug use, crime prevention, the reporting of crimes, sexual assault, and other matters. The ASR is available for review here.

     

    Employment with the University will require successful completion of background check(s) in accordance with University policies. 

     

    EEO IS THE LAW

    Penn State is an equal opportunity, affirmative action employer, and is committed to providing employment opportunities to all qualified applicants without regard to race, color, religion, age, sex, sexual orientation, gender identity, national origin, disability or protected veteran status. If you are unable to use our online application process due to an impairment or disability, please contact 814-865-1473.

  • 2024-01-04 15:18 | Anonymous

    Senior Data Scientist at the University of Chicago

     

    Please apply at https://uchicago.wd5.myworkdayjobs.com/External/job/Chicago-IL/Sr-Data-Scientist_JR24587

     

    About the Department
     

    The TMW Center for Early Learning + Public Health (TMW Center) develops science-based interventions, tools, and technologies to help parents and caregivers interact with young children in ways that maximize brain development. A rich language environment is critical to healthy brain development, however few tools exist to measure the quality or quantity of these environments. Access to this type of data allows caregivers to enhance interactions in real-time and gives policy-makers insight in how to best build policies that have a population-level impact.

    The wearable team within TMW Center is building a low-cost wearable device that can reliably and accurately measure a child’s early language environment vis-à-vis the conversational turns between a child and caregiver. The goal is to provide accurate, real-time feedback that empowers parents and caregivers to create the best language environment for their children.


    Job Summary
     

    The job works independently to perform a variety of activities relating to software support and/or development. Analyzes, designs, develops, debugs, and modifies computer code for end user applications, beta general releases, and production support. Guides development and implementation of applications, web pages, and user-interfaces using a variety of software applications, techniques, and tools. Solves complex problems in administration, maintenance, integration, and troubleshooting of code and application ecosystem currently in production.

    We are searching for a strategic and inquisitive senior data scientist to develop and optimize innovative AI-based models focused on speech/audio processing. The senior data scientist is expected to outline requirements, brainstorm ideas and solutions with leadership, manage data integrity and conduct experiments, assign tasks to junior staff, and monitor performance of the team.

     

    Responsibilities

    • Formulates, suggests, and manages data-driven projects to support the development of audio algorithms and use cases.
    • Analyzes data from various entities for later use by junior data scientists.
    • Assesses scope and timelines, prioritize goals, and prepare project plans to meet product and research objectives.
    • Delegates tasks to junior data scientists and provide coaching to improve quality of work.
    • Continuously trains and nurtures data scientists to take on bigger assignments.
    • Provides leadership in advancing the science of TMW Center interventions by generating new ideas and collaborating with the research analysis team.
    • In collaboration with CTO, selects and guides decisions on statistical procedures and model selections, including conducting exploratory experiments to develop proof of concept.
    • Cross-validate models to ensure generalization and predictability.
    • Stays informed about developments in Data Science and adjacent fields to ensure most relevant methods and outputs are being leveraged.
    • Ensures data governance is in place to comply with regulations and privacy standards and maintain documentation of methodologies, coding, and results.
    • Designs new systems, features, and tools. Solves complex problems and identifies opportunities for technical improvement and performance optimization. Reviews and tests code to ensure appropriate standards are met.
    • Utilizes technical knowledge of existing and emerging technologies, including public cloud offerings from Amazon Web Services, Microsoft Azure, and Google Cloud.
    • Acts as a technical consultant and resource for faculty research, teaching, and/or administrative projects.
    • Performs other related work as needed.


    Minimum Qualifications
     

    Education:

    Minimum requirements include a college or university degree in related field.

    ---
    Work Experience:

    Minimum requirements include knowledge and skills developed through 5-7 years of work experience in a related job discipline.

    ---
    Certifications:

    ---

    Preferred Qualifications

    Education:

    • Master’s degree in Computer Science, Statistics, Mathematics, or Economics with a focus on computer science.

    Experience:

    • Experience with Machine Learning and LLMs.
    • Experience working on audio or speech data.
    • Experience implementing edge models using TensorFlow micro, TensorFlow lite, and corresponding quantization techniques.
    • Experience building audio classification models or speech to text models.
    • Experience using the latest pre-trained models such as whisper and wav2vec.
    • Proven experience taking an idea or user need and translating it into fully realized applications.
    • Ability to relay insights in layman’s terms to inform business decisions. 
    • 3+ years leading and managing junior data scientists.

    Technical Skills or Knowledge:

    • Proficiency in Python, Pytorch, Tensorflow, TinyML, Pandas and Numpy.
    • Experience with cloud environments such as AWS, Azure or GCloud.
    • Experience with command line interfaces (Linux, SSH).
    • Experience processing large datasets with Spark, Dask or Ray.

    Application Documents

    • Resume (required)
    • Cover Letter (preferred)


    When applying, the document(s) MUST be uploaded via the My Experience page, in the section titled Application Documents of the application.

  • 2024-01-04 15:17 | Anonymous

    Title: Predictive Modeling of Subjective Disagreement in Speech Annotation/Evaluation Host laboratory : LIUM

    Location : Le Mans

    Supervisors : Meysam Shamsi, Anthony Larcher

    Beginning of internship : February 2024

    Application deadline : 10/01/2024

    Keywords: Subjective Disagreement Modeling, Synthetic Speech Quality Evaluation, Speech Emotion Recognition In the context of modeling subjective tasks, where diverse opinions, perceptions, and judgments exist among individuals, such as in speech quality or speech emotion recognition, addressing the challenge of defining ground truth and annotating a training set becomes crucial. The current practice of aggregating all annotations into a single label for modeling a subjective task is neither fair nor efficient [1]. The variability in annotations or evaluations can stem from various factors [2], broadly categorized into those associated with corpus quality and those intrinsic to the samples themselves. In the first case, the delicate definition of a subjective task introduces sensitivity into the annotation process, potentially leading to more errors, especially where the annotation tools and platform lack precision or annotators experience fatigue. In the second case, the inherent ambiguity in defining a subjective task and different perception may result in varying annotations and disagreements. Developing a predictive model to understand annotator/evaluator disagreement is crucial for engaging in discussions related to ambiguous samples and refining the definition of subjective concepts. Furthermore, this model can serve as a valuable tool for assessing the confidence of automatic evaluations [3,4]. This modeling approach will contribute to the automatic evaluation of corpus annotations, identification of ambiguous samples for reconsideration or re-annotation, automatic assessment of subjective models, and the detection of underrepresented samples and biases in the dataset. The proposed research involves utilizing a speech dataset such as MS-Podcast [5], SOMOS [6], VoiceMOS [7], for a subjective task with multiple annotations per sample. The primary objective is to predict the variation in assigned labels, measured through disagreement scores, entropy, or distribution.

    Reference: [1]. Davani, A. M., Díaz, M., & Prabhakaran, V. (2022). Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10, 92-110.

    [2]. Kreiman, J., Gerratt, B. R., & Ito, M. (2007). When and why listeners disagree in voice quality assessment tasks. The Journal of the Acoustical Society of America, 122(4), 2354-2364.

    [3]. Wu, W., Chen, W., Zhang, C., & Woodland, P. C. (2023). It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation. arXiv preprint arXiv:2310.00486.

    [4]. Han, J., Zhang, Z., Schmitt, M., Pantic, M., & Schuller, B. (2017, October). From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 25th ACM international conference on Multimedia (pp. 890-897).

    [5]. Lotfian, R., & Busso, C. (2017). Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4), 471-483.

    [6]. Maniati, G., Vioni, A., Ellinas, N., Nikitaras, K., Klapsas, K., Sung, J.S., Jho, G., Chalamandaris, A., Tsiakoulis, P. (2022) SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis. Proc. Interspeech 2022, 2388-2392 [7]. Cooper, E., Huang, W. C., Tsao, Y., Wang, H. M., Toda, T., & Yamagishi, J. (2023). The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains. arXiv preprint arXiv:2310.02640.

    Applicant profile : Candidate motivated by artificial intelligence, enrolled in a Master's degree in Computer Science or related fields

    For application: Send CV + cover letter to : meysam.shamsi@univ-lemans.fr or anthony.larcher@univ-lemans.fr before 10/01/2024

  • 2024-01-04 15:16 | Anonymous

    ANR Project «REVITALISE»

    Automatic speech analysis of public talks. 

    Description. Today, humanity has reached a stage at which extremely important aspects (such as information exchange) are tied not only to actual so-called hard skills, but also to soft skills. One such important skill is public speaking. Like many forms of interaction between people, the assessment of public speaking depends on many factors (often subjectively perceived). The goal of our project is to create an automatic system which can take into account these different factors and evaluate the quality of the performance. This requires understanding which elements can be assessed objectively and which vary depending on the listener [Hemamou, Wortwein, Chollet21]. For such an analysis, it is necessary to analyze public speaking at various levels: high-level (audio, video, text), intermediate (voice monotony, auto-gestures, speech structure, and etc.) and low-level (fundamental frequency, action units, POS / tags, and etc.) [Barkar]. This internship offers an opportunity to analyze the audio component of a public speech. The student is asked to solve two main problems. The engineering task is to create an automatic speech transcription system that detects speech disfluency. To do this, it is proposed to collect a bibliography on this topic and come up with an engineering solution. The second, research task, is to use audio cues to automatically analyze the success of a performance of a talk. This internship will give you an opportunity to solve an engineering problem as well as learn more about research approaches. By the end you will have expertise in audio processing as well and machine learning methods for multimodal analysis. If the internship is successfully completed, an article may be published. PhD position fundings on Social Computing will be accessible in the team at the end of the internship (at INRIA).

    Registration & Organisation. Name of organization: Institut Polytechnique de Paris, Telecom-Paris Website of organization: https://www.telecom-paris.fr Department: IDS/LTCI/ Address: Palaiseau, France

    Supervision. Supervision will include weekly meetings with the main supervisor and regular meetings (every 2-3 weeks) with co-supervisors. Telecom-Paris, 2023-2024 ANR Project «REVITALISE» Name of supervisor: Alisa BARKAR Name of co-supervisor: Chloe Clavel, Mathieu Chollet, Béatrice BIANCARDI Contact details: alisa.barkar@telecom-paris.fr

    Duration & Planning. The internship is planned as a 5-6 month full-time internship for the spring semester 2024. 6 months considers 24 weeks within which it will be covered following list of activities:

    ● ACTIVITY 1(A1): Problem description and integration to the working environment

    ● ACTIVITY 2(A2): Bibliography overview

    ● ACTIVITY 3(A3): Implementation of the automatic transcription with detected discrepancies

    ● ACTIVITY 4(A4): Evaluation of the automatic transcription

    ● ACTIVITY 5(A5): Application of the developed methods to the existing data

    ● ACTIVITY 6(A6): Analysis of the importance of para-verbal features for the performance perception

    ● ACTIVITY 7(A7): Writing the report

    Selected references of the team.

    1. [Hemamou] L. Hemamou, G. Felhi, V. Vandenbussche, J.-C. Martin, C. Clavel, HireNet: a Hierarchical Attention Model for the Automatic Analysis of Asynchronous Video Job Interviews. in AAAI 2019, to appear

    2. [Ben-Youssef] Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim. Ue-hri: a new dataset for the study of user engagement in spontaneous human-robot interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 464–472. ACM, 2017.

    3. [Wortwein] Torsten Wörtwein, Mathieu Chollet, Boris Schauerte, Louis-Philippe Morency, Rainer Stiefelhagen, and Stefan Scherer. 2015. Multimodal Public Speaking Performance Assessment. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI '15). Association for Computing Machinery, New York, NY, USA, 43–50.

    4. [Chollet21] Chollet, M., Marsella, S., & Scherer, S. (2021). Training public speaking with virtual social interactions: effectiveness of real-time feedback and delayed feedback. Journal on Multimodal User Interfaces, 1-13.

    5. [Barkar] Alisa Barkar, Mathieu Chollet, Beatrice Biancardi, and Chloe Clavel. 2023. Insights Into the Importance of Linguistic Textual Features on the Persuasiveness of Public Speaking. In Companion Publication of the 25th International Conference on Multimodal Interaction (ICMI '23 Companion). Association for Computing Machinery, New York, NY, USA, 51–55. https://doi.org/10.1145/3610661.3617161 Telecom-Paris, 2023-2024 ANR Project «REVITALISE»

    Other references. 

    1. Dinkar, T., Vasilescu, I., Pelachaud, C. and Clavel, C., 2020, May. How confident are you? Exploring the role of fillers in the automatic prediction of a speaker’s confidence. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8104-8108). IEEE.

    2. Whisper: Robust Speech Recognition via Large-Scale Weak Supervision, Radford A. et al., 2022, url: https://arxiv.org/abs/2212.04356

    3. Romana, Amrit and Kazuhito Koishida. “Toward A Multimodal Approach for Disfluency Detection and Categorization.” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023): 1-5.

    4. Radhakrishnan, Srijith et al. “Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition.” ArXiv abs/2310.06434 (2023): n. pag.

    5. Wu, Xiao-lan et al. “Explanations for Automatic Speech Recognition.” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023): 1-5.

    6. Min, Zeping and Jinbo Wang. “Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study.” ArXiv abs/2307.06530 (2023): n. pag.

    7. Ouhnini, Ahmed et al. “Towards an Automatic Speech-to-Text Transcription System: Amazigh Language.” International Journal of Advanced Computer Science and Applications (2023): n. pag.

    8. Bigi, Brigitte. “SPPAS: a tool for the phonetic segmentations of Speech.” (2023).

    9. Rekesh, Dima et al. “Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition.” ArXiv abs/2305.05084 (2023): n. pag.

    10. Arisoy, Ebru et al. “Bidirectional recurrent neural network language models for automatic speech recognition.” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015): 5421-5425.

    11. Padmanabhan, Jayashree and Melvin Johnson. “Machine Learning in Automatic Speech Recognition: A Survey.” IETE Technical Review 32 (2015): 240 - 251.

    12. Berard, Alexandre et al. “End-to-End Automatic Speech Translation of Audiobooks.” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018): 6224-6228.

    13. Kheir, Yassine El et al. “Automatic Pronunciation Assessment - A Review.” ArXiv abs/2310.13974 (2023): n. pag. Telecom-Paris, 2023-2024

  • 2024-01-04 15:15 | Anonymous

    Evaluation des systèmes de synthèse de la parole dans un environnement bruyant 

     Sujet L’´evaluation perceptive est capitale dans de nombreux domaines li´es au technologie de la parole dont la synth`ese de la parole. Elle permet d’´evaluer la qualit´e de la synth`ese de mani`ere subjective en demandant `a un jury[5] de noter la qualit´e d’un stimuli de parole synth´etis´ee[1, 2]. De r´ecent travaux ont permis de d´evelopper un mod`ele d’intelligence artificielle[3, 4] qui permet de pr´edire l’´evaluation subjective d’un segment de parole synth´etis´ee, ainsi permettant de s’affranchir d’un test par jury. Le probl`eme majeur de cette ´evaluation est l’interpr´etation du mot “qualit´e”. Certains peuvent baser leur jugement sur les caract´eristiques intrins`eques de la parole (tel que le timbre, le d´ebit de parole, la ponctuation, etc) alors que d’autres peuvent baser leur jugement sur les caract´eristiques li´es au signal audio (comme la pr´esence ou non de distorsion). Ainsi, l’´evaluation subjective de la parole peut ˆetre biais´ee par l’interpr´etation de la consigne par les auditeurs. Par cons´equent, le mod`ele d’intelligence artificielle mentionn´e ci-dessus peut ˆetre ainsi bas´e sur des mesures biais´ees. Le projet a pour but de r´ealiser un travail exploratoire pour ´evaluer la qualit´e de la synth`ese de la parole d’une mani`ere plus robuste que celle ayant ´et´e propos´e jusqu’ici. Pour ceci, nous partons de l’hypoth`ese que la qualit´e de la synth`ese de la parole peut ˆetre estim´ee par le biais de sa d´etection dans un environnement r´eel. En d’autre termes, un signal synth´etis´e parfaitement pour reproduire un signal de parole humaine ne devrait pas ˆetre d´etect´e dans un environnement de la vie quotidienne. Bas´e sur cette hypoth`ese, nous proposons donc de monter une exp´erience de perception de la parole en milieu bruyant. Il existe des m´ethodes de reproduction de milieu sonore qui permettent de simuler un environnement existant au casque. L’avantage de ces m´ethodes c’est qu’il est ´egalement possible de jouer un enregistrement d’un milieu r´eel au casque tout en ajoutant des signaux comme s’il avait ´et´e pr´esent dans la sc`ene sonore enregistr´ee. Ceci implique d’une part une campagne de mesure acoustique dans des environnement bruyant de la vie quotidienne (transport, open space, cantine, etc). Ensuite, une g´en´eration de parole synth´etis´ee sera n´ecessaire tout en prenant en compte le contexte des enregistrements. Il sera ´egalement pertinent de faire varier les param`etres de la parole synth´etis´ee tout en gardant la mˆeme s´emantique. Les enregistrements de la vie quotidienne seront ensuite mix´es aux signaux de parole synth´etis´ee pour ´evaluer la d´etection de cette derni`ere. Nous utiliserons le pourcentage de fois que la parole synth´etis´ee sera d´etect´ee comme indicateur de qualit´e. Ces pourcentages de d´etection seront ensuite compar´es au pr´ediction du mod`ele d’intelligence artificielle mentionn´e ci-dessus. Ainsi, nous pourrons conclure (1) si les m´ethodes sont ´equivalentes ou compl´ementaires et (2) quel(s) param`etre(s) de la parole synth´etis´ee engendre une d´etection de cette derni`ere en milieu bruyant.

    Informations compl´ementaires:

    • Encadrement: Le stage sera co-encadr´e par Aghilas Sini, maˆıtre de conf´erence au Laboratoire d’Informatique de l’Universit´e du Mans (aghilas.sini@univ-lemans.fr) et Thibault Vicente, maˆıtre de conf´erence au Laboratoire d’Acoustique de l’Universit´e du Mans (thibault.vicente@univ-lemans.fr)

    • Niveau requis: Stage de M2 recherche

    • P´eriode envisag´ee: 6 mois (F´evrier `a Juillet 2024)

    • Lieu: Le Mans Universit´e

    • mots-cl´es: parole synth´etis´ee, synth`ese sonore binaurale, test par jury

    References

    [1] Y.-Y. Chang. Evaluation of tts systems in intelligibility and comprehension tasks. In Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing (ROCLING 2011), pages 64–78, 2011.

    [2] J. Chevelu, D. Lolive, S. Le Maguer, and D. Guennec. Se concentrer sur les diff´erences: une m´ethode d’´evaluation subjective efficace pour la comparaison de syst`emes de synth`ese (focus on differences: a subjective evaluation method to efficiently compare tts systems*). In Actes de la conf´erence conjointe JEP-TALN-RECITAL 2016. volume 1: JEP, pages 137–145, 2016.

    [3] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang. MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion. In Proc. Interspeech 2019, pages 1541–1545, 2019.

    [4] G. Mittag and S. M¨oller. Deep learning based assessment of synthetic speech naturalness. arXiv preprint arXiv:2104.11673, 2021.

    [5] M. Wester, C. Valentini-Botinhao, and G. E. Henter. Are we using enough listeners? no!—an empirically-supported critique of interspeech 2014 tts evaluations. In 16th Annu. Conf. Int. Speech Commun. Assoc., 2015.

  • 2024-01-04 15:14 | Anonymous

    M2 Master Internship 

    Automatic Alsatian speech recognition 

    1 Supervisors

    Name: Emmanuel Vincent

    Team and lab: Multispeech team, Inria research center at Université de Lorraine, Nancy

    Email: emmanuel.vincent@inria.fr

    Name: Pascale Erhart

    Team and lab: Language/s and Society team, LiLPa, Strasbourg

    Email: pascale.erhart@unistra.fr

    2 Motivation and context This internship is part of the Inria COLaF project (Corpora and tools for the languages of France), whose objective is to develop and disseminate inclusive language corpora and technologies for regional languages (Alsatian, Breton, Corsican, Occitan, Picard, etc.), overseas languages and non-territorial immigration languages of France. With few exceptions, these languages are largely ignored by language technology providers [1]. However, such technologies are keys to the protection, promotion and teaching of these languages. Alsatian is the second regional language spoken in France in terms of number of speakers, with 46% of Alsace residents saying they speak Alsatian fairly well or very well [2]. However, it remains an underresourced language in terms of data and language technologies. Attempts at machine translation have been made as well as data collection [3].

    3 Objectives The objective of the internship is to design an automatic speech recognition system for Alsatian based on sound archives (radio, television, web, etc.). This raises two challenges: i) Alsatian is not a homogeneous language but a continuum of dialectal varieties which are not always written in a standardized way, ii) the textual transcription is often unavailable or differs from the pronounced words (transcription errors , subtitles, etc.). Solutions will be based on i) finding a suitable methodology for choosing and preparing data, ii) designing an automatic speech recognition system using end-to-end neural networks which can rely on the adaptation of an existing multilingual system like Whisper [4] in a self-supervised manner from a number of untranscribed recordings [5] and in a supervised manner from a smaller number of transcribed recordings, or even from text-only data [6]. The work will be based on datasets collected by LiLPa and the COLaF project’s engineers, which include the television shows Sunndi's Kater [7] and Kùmme Mit [8] whose dialogues are scripted, some radio broadcasts from the 1950s–1970s with their typescripts [9], as well as untranscribed radio broadcasts of France Bleu Elsass. Dictionaries of Alsatian such as the Wörterbuch der elsässischen Mundarten which can be consulted via the Woerterbuchnetz portal [10] or phonetization initiatives [11] could be exploited, for example using Orthal spelling [12]. The internship opens the possibility of pursuing a PhD thesis funded by the COLaF project.

    4 Bibliography

    [1] DGLFLF, Rapport au Parlement sur la langue française 2023, https://www.culture.gouv.fr/Media/Presse/Rapport-au-Parlement-surla-langue-francaise-2023

    [2] https://www.alsace.eu/media/5491/cea-rapport-esl-francais.pdf

    [3] D. Bernhard, A-L Ligozat, M. Bras, F. Martin, M. Vergez-Couret, P. Erhart, J. Sibille, A. Todirascu, P. Boula de Mareüil, D. Huck, “Collecting and annotating corpora for three under-resourced languages of France: Methodological issues”, Language Documentation & Conservation, 2021, 15, pp.316-357.

    [4] A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, “Robust speech recognition via large-scale weak supervision”, in 40th International Conference on Machine Learning, 2023, pp. 28492-28518.

    [5] A. Bhatia, S. Sinha, S. Dingliwal, K. Gopalakrishnan, S. Bodapati, K. Kirchhoff, “Don't stop selfsupervision: Accent adaptation of speech representations via residual Adapters”, in Interspeech, 2023, pp. 3362-3366.

    [6] N. San, M. Bartelds, B. Billings, E. de Falco, H. Feriza, J. Safri, W. Sahrozi, B. Foley, B. McDonnell, D. Jurafsky, “Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions”, in 6th Workshop on Computational Methods for Endangered Languages, 2023, pp. 1-6.

    [7] https://www.france.tv/france-3/grand-est/sunndi-s-kater/ [8] https://www.france.tv/france-3/grand-est/kumme-mit/toutes-les-videos/ [9] https://www.ouvroir.fr/cpe/index.php?id=1511 [10] https://woerterbuchnetz.de/?sigle=ElsWB#0 [11] 10.5281/zenodo.1174213 [12] https://orthal.fr/ 

    5 Profile MSc in speech processing, natural language processing, computational linguistics, or computer science. Strong programming skills in Python/Pytorch. Knowledge of Alsatian and/or German is a plus, but in no way a prerequisite

  • 2024-01-04 15:13 | Anonymous

    -- Post-doctoral research position - L3i - La Rochelle France

    ---------------------------------------------------------------------------------------------------------------------------

    Title : Emotion detection by semantic analysis of the text in comics speech balloons

     

    The L3i laboratory has one open post-doc position in computer science, in the specific field of natural language processing in the context of digitised documents.

     

    Duration: 12 months (an extension of 12 months will be possible)

    Position available from: as soon as possible

    Salary: approximately 2100 € / month (net)

    Place: L3i lab, University of La Rochelle, France

    Specialty: Computer Science/ Document Analysis/ Natural Language Processing

    Contact: Jean-Christophe BURIE (jcburie [at] univ-lr.fr) / Antoine Doucet (antoine.doucet [at] univ-lr.fr)

     

    Position Description

    The L3i is a research lab of the University of La Rochelle. La Rochelle is a city in the south west of France on the Atlantic coast and is one of the most attractive and dynamic cities in France. The L3i works since several years on document analysis and has developed a well-known expertise in ‘Bande dessinée”, manga and comics analysis, indexing and understanding.

    The work done by the post-doc will take part in the context of SAiL (Sequential Art Image Laboratory) a joint laboratory involving L3i and a private company. The objective is to create innovative tools to index and interact with digitised comics. The work will be done in a team of 10 researchers and engineers.

    The team has developed different methods to extract and recognise the text of the speech balloons. The specific task of the recruited researcher will be to use Natural Language Processing strategies to analyse the text in order to identify emotions expressed by a character (reacting to the utterance of another speaking character) or caused by it (talking to another character). The datasets will be collections of comics in French and English.

     

    Qualifications

    Candidates must have a completed PhD and a research experience in natural language processing. Some knowledge and experience in deep learning is also recommended.

     

    General Qualifications

    • Good programming skills mastering at least one programming language like Python, Java, C/C++

    • Good teamwork skills

    • Good writing skills and proficiency in written and spoken English or French

     

    Applications

    Candidates should send a CV and a motivation letter to jcburie [at] univ-lr.fr and antoine.doucet [at] univ-lr.fr.

 Organisation  Events   Membership   Help 
 > Board  > Interspeech  > Join - renew  > Sitemap
 > Legal documents  > Workshops  > Membership directory  > Contact
 > Logos      > FAQ
       > Privacy policy

© Copyright 2024 - ISCA International Speech Communication Association - All right reserved.

Powered by Wild Apricot Membership Software