ISCA - International Speech |
A seminar programme is an important part of the life of a research lab, especially for its research students, but it's difficult for scientists to travel to give talks at the moment. However, presentations may be given on line and, paradoxically, it may thus be possible for labs to engage international speakers who they wouldn't normally be able to afford.
ISCA has set up a pool of speakers prepared to give on-line talks. In this way we can enhance the experience of students working in our field, often in difficult conditions.
Speakers may pre-record their talks if they wish, but they don't have to. It is up to the host lab to contact speakers and make the arrangements. Talks can be state-of-the-art, or tutorials.
If you make use of this scheme and arrange a seminar, please send brief details (lab, speaker, date) to education@isca-speech.org
The scheme complements this Distinguished Lecturers programme.
If you wish to join the scheme as a speaker, we need is a title, a short abstract, a 1 paragraph biopic and contact details. Please send them to education@isca-speech.org
The speakers and their titles are given in this table. Further details follow.
Roger Moore |
Talk #1: Talking with Robots: Are We Nearly There Yet?Talk #2: A needs-driven cognitive architecture for future ‘intelligent’ communicative agents |
|
Martin Cooke |
The perception of distorted speech |
|
Sakriani Sakti |
ssakti@is.naist.jp |
Semi-supervised Learning for Low-resource Multilingual and Multimodal Speech Processing with Machine Speech Chain |
John Hansen | John.Hansen@ utdallas.edu |
Robust Diarization in Naturalistic Audio Streams: Recovering the Apollo Mission Control Audio |
Thomas Hueber |
Articulatory-acoustic modeling for assistive speech technologies: a focus on silent speech interfaces and biofeedback systems |
|
Karen Livescu |
Recognition of Fingerspelled Words in American Sign Language in the Wild |
|
Odette Scharenborg |
Talk 1: Reaching over the gap: Cross- and interdisciplinary research on human and automatic speech processing.Talk2: Speech representations and processing in deep neural networks. |
|
Shrikanth (Shri) Narayanan |
shri@ee.usc.edu |
Talk 1: Sounds of the human vocal instrument Talk2: Computational Media Intelligence: Human-centered Machine Analysis of Media Talk3: Multimodal Behavioral Machine Intelligence for health applications |
Ann Bradlow
|
abradlow@northwestern.edu |
Second-language Speech Recognition by Humans and Machines |
Shinji Watanabe
|
shinjiw@ieee.org |
Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition
|
Giuseppe Riccardi | giuseppe.riccardi@unitn.it | Empathy in Human Spoken Conversations |
Amalia Arvaniti | amalia.arvaniti@ru.nl | Forty years of the autosegmental-metrical theory of intonational phonology: an update and critical review in light of recent findings |
Eric Fosler-Lussier | fosler@cse.ohio-state.edu | Low resourced but long tailed spoken dialogue system building |
Heiga Zen | heigazen@google.com | Model-based text-to-speech synthesis |
Ralf Schlueter |
schlueter@cs.rwth-aachen.de | Automatic Speech Recognition in a State-of-Flux |
Nancy Chen |
nfychen@i2r.a-star.edu.sg |
Controllable Neural Language Generation Summarizing Conversations: From Meetings to Social Media Chats |
Contact: r.k.moore [AT] sheffield.ac.uk
Web: http://staffwww.dcs.shef.ac.uk/people/R.K.Moore/
Prof. Moore has over 40 years’ experience in Speech Technology R&D and, although an engineer by training, much of his research has been based on insights from human speech perception and production. As Head of the UK Government's Speech Research Unit from 1985 to 1999, he was responsible for the development of the Aurix range of speech technology products and the subsequent formation of 20/20 Speech Ltd. Since 2004 he has been Professor of Spoken Language Processing at the University of Sheffield, and also holds Visiting Chairs at Bristol Robotics Laboratory and University College London Psychology & Language Sciences. He was President of the European/International Speech Communication Association from 1997 to 2001, General Chair for INTERSPEECH-2009 and ISCA Distinguished Lecturer during 2014-15. In 2017 he organised the first international workshop on ‘Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR)’. Prof. Moore is the current Editor-in-Chief of Computer Speech & Language and in 2016 he was awarded the LREC Antonio Zampoli Prize for "Outstanding Contributions to the Advancement of Language Resources & Language Technology Evaluation within Human Language Technologies” and in 2020 he was given the International Speech Communication Association Special Service Medal for "service in the establishment, leadership and international growth of ISCA".
Abstract: Recent years have seen considerable progress in the deployment of 'intelligent' communicative agents such as Apple's Siri and Amazon’s Alexa. However, effective speech-based human-robot dialogue is less well developed; not only do the fields of robotics and spoken language technology present their own special problems, but their combination raises an additional set of issues. In particular, there appears to be a large gap between the formulaic behaviour that typifies contemporary spoken language dialogue systems and the rich and flexible nature of human-human conversation. As a consequence, we still seem to be some distance away from creating Autonomous Social Agents such as robots that are truly capable of conversing effectively with their human counterparts in real world situations. This talk will address these issues and will argue that we need to go far beyond our current capabilities and understanding if we are to move from developing robots that simply talk and listen to evolving intelligent communicative machines that are capable of entering into effective cooperative relationships with human beings.
Talk #2: A needs-driven cognitive architecture for future ‘intelligent’ communicative agents
Abstract: Recent years have seen considerable progress in the deployment of ‘intelligent’ communicative agents such as Apple’s Siri, Google Now, Microsoft’s Cortana and Amazon’s Alexa. Such speech-enabled assistants are distinguished from the previous generation of voice-based systems in that they claim to offer access to services and information via conversational interaction. In reality, interaction has limited depth and, after initial enthusiasm, users revert to more traditional interface technologies. This talk argues that the standard architecture for a contemporary communicative agent fails to capture the fundamental properties of human spoken language. So an alternative needs-driven cognitive architecture is proposed which models speech-based interaction as an emergent property of coupled hierarchical feedback control processes. The implications for future spoken language systems are discussed.
Martin Cooke is Ikerbasque Research Professor in the Language and Speech Lab at the University of the Basque Country, Spain. After starting his career in the UK National Physical Laboratory, he worked at the University of Sheffield for 26 years before taking up his current position. His research has focused on analysing the computational auditory scene, investigating human speech perception and devising algorithms for robust automatic speech recognition. His interest in these domains also includes the effects of noise on speech production, as well as second language listening and acquisition models. He currently coordinates the EU Marie Curie Network ENRICH which focuses on listening effort.
The perception of distorted speech
Listeners are capable of accommodating a staggering variety of speech forms, including those that bear little superficial resemblance to canonical speech. Speech can be understood on the basis of a mere pair of synthetic formants sent to different ears, or from three time-varying sinewaves, or from four bands of modulated noise. Surprisingly high levels of accuracy can be obtained from the summed output of two exceedingly narrow filters at the extremes of the frequency range of speech, or after interchanging the fine structure with a non-speech signal such as music. Any coherent theory of human speech perception must be able to account not just for the processing of canonical speech but also explain how an individual listener, perhaps aided by a period of perceptual learning, is capable of handling all of these disparate distorted forms of speech. In the first part of the talk I'll review a century's worth of distorted speech types, suggest some mechanisms listeners might be using to accomplish this feat, and provide anecdotal evidence for a human-machine performance gap for these speech types. In the second part I'll present results from two recent studies in my lab, one concerning a new form of distortion I call sculpted speech, the second looking at the fine time course of perceptual adaptation to eight types of distortion. I'll conclude with some pointers to what the next generation of machine listening might learn from human abilities to process these extremely-variable forms.
Contact: ssakti [AT] is.naist.jp
Sakriani Sakti is currently a research associate professor at Nara Institute of Science and Technology (NAIST) and a research scientist at RIKEN Center for Advanced Intelligent Project (RIKEN AIP), Japan. She received DAAD-Siemens Program Asia 21st Century Award in 2000 to study in Communication Technology, University of Ulm, Germany, and received her MSc degree in 2002. During her thesis work, she worked with the Speech Understanding Department, DaimlerChrysler Research Center, Ulm, Germany. She then worked as a researcher at ATR Spoken Language Communication (SLC) Laboratories Japan in 2003-2009, and NICT SLC Groups Japan in 2006-2011, which established multilingual speech recognition for speech-to-speech translation. While working with ATR and NICT, Japan, she continued her study (2005-2008) with Dialog Systems Group University of Ulm, Germany, and received her Ph.D. degree in 2008. She was actively involved in international collaboration activities such as Asian Pacific Telecommunity Project (2003-2007) and various speech-to-speech translation research projects, including A-STAR and U-STAR (2006-2011). In 2011-2017, she was an assistant professor at the Augmented Human Communication Laboratory, NAIST, Japan. She also served as a visiting scientific researcher of INRIA Paris-Rocquencourt, France, in 2015-2016, under JSPS Strategic Young Researcher Overseas Visits Program for Accelerating Brain Circulation. Since January 2018, she serves as a research associate professor at NAIST and a research scientist at RIKEN AIP, Japan. She is a member of JNS, SFN, ASJ, ISCA, IEICE, and IEEE. Furthermore, she is currently a committee member of IEEE SLTC (2021-2023) and an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020-2023). She was a board member of Spoken Language Technologies for Under-resourced languages (SLTU) and the general chair of SLTU2016. She was also the general chair of the "Digital Revolution for Under-resourced Languages (DigRevURL)" Workshop as the Interspeech Special Session in 2017 and DigRevURL Asia in 2019. She was also the organizing committee of the Zero Resource Speech Challenge 2019 and 2020. She was also involved in creating joint ELRA and ISCA Special Interest Group on Under-resourced Languages (SIGUL) and served as SIGUL Board since 2018. Last year, in collaboration with UNESCO and ELRA, she was also the organizing committee of the International Conference of "Language Technologies for All (LT4All): Enabling Linguistic Diversity and Multilingualism Worldwide". Her research interests lie in deep learning & graphical model framework, statistical pattern recognition, zero-resourced speech technology, multilingual speech recognition and synthesis, spoken language translation, social-affective dialog system, and cognitive-communication.
Semi-supervised Learning for Low-resource Multilingual and Multimodal Speech Processing with Machine Speech Chain
The development of advanced spoken language technologies based on automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has enabled computers to either learn how to listen or speak. Many applications and services are now available but still support fewer than 100 languages. Nearly 7000 living languages that are spoken by 350 million people remain uncovered. This is because the construction is commonly done based on machine learning trained in a supervised fashion where a large amount of paired speech and corresponding transcription is required.
In this talk, we will introduce a semi-supervised learning mechanism based on a machine speech chain framework. First, we describe the primary machine speech chain architecture that learns not only to listen or speak but also to listen while speaking. The framework enables ASR and TTS to teach each other given unpaired data. After that, we describe the use of machine speech chain for code-switching and cross-lingual ASR and TTS of several languages, including low-resourced ethnic languages. Finally, we describe the recent multimodal machine chain that mimics overall human communication to listen while speaking and visualizing. With the support of image captioning and production models, the framework enables ASR and TTS to improve their performance using an image-only dataset.
Contact: John.Hansen [AT] utdallas.edu
John H.L. Hansen, received Ph.D. & M.S. degrees from Georgia Institute of Technology, and B.S.E.E. degree from Rutgers Univ. He joined Univ. of Texas at Dallas (UTDallas) in 2005, where he is Associate Dean for Research, Professor of Electrical & Computer Engineering, Distinguished Univ. Chair in Telecommunications Engineering, and holds a joint appointment in School of Behavioral & Brain Sciences (Speech & Hearing). At UTDallas, he established Center for Robust Speech Systems (CRSS). He is an ISCA Fellow, IEEE Fellow, past Member and TC-Chair of IEEE Signal Proc. Society, Speech & Language Proc. Tech. Comm.(SLTC), and Technical Advisor to U.S. Delegate for NATO (IST/TG-01). He currently serves as ISCA President. He has supervised 92 PhD/MS thesis candidates, was recipient of 2020 UT-Dallas Provost’s Award for Grad. Research Mentoring, 2005 Univ. Colorado Teacher Recognition Award, and author/co-author of +750 journal/conference papers in the field of speech/language/hearing processing & technology. He served as General Chair for Interspeech-2002, Co-Organizer and Tech. Chair for IEEE ICASSP-2010, and Co-General
Chair and Organizer for IEEE Workshop on Spoken Language Technology (SLT-2014) (Lake Tahoe, NV). He is serving as Co-Chair for ISCA INTERSPEECH-2022, and Tech. Chair for IEEE ICASSP-2024.
Speech Technology has advanced significantly beyond general speech recognition for voice command and telephony applications. Today, the emergence of BIG DATA, Machine Learning, as well as voice enabled speech systems have required the need for effective voice capture and automatic speech/speaker recognition. The ability to employ speech and language technology to assess human-to-human interactions is opening up new research paradigms which can have a profound impact on assessing human
interaction. In this talk, we will focus on BIG DATA audio processing relating to the APOLLO lunar missions. ML based technology advancements include automatic audio diarization and speaker recognition for audio streams which include multi-tracks, speakers, and environments. CRSS-UTDallas built a recovery solution for lost 30-track audio tapes from NASA Apollo-11, resulting in a massive multi-track audio processing (19,000hrs) of data. Recent additional support from NSF will allow for the recovery and organization of an additional 150,000hrs of mission data to be shared with the communities of: (i) speech/language technology, (ii) STEM/science and team-based researchers, and (iii) education/historical/archiving specialists.
Contact: thomas.hueber [AT] gipsa-lab.fr
Dr. Thomas Hueber is a tenured CNRS researcher at GIPSA-lab (Grenoble, France) since 2011. He is head of the "Cognitive Robotics, Interactive System and Speech Processing" (CRISSP) team. He holds an engineering degree and a M.Sc. in Signal Processing from University of Lyon in 2006, a Ph.D. in Computer Science from Pierre and Marie Curie University (Paris) in 2009, and a HDR (accreditation to supervise research) from Grenoble-Alpes University in 2019. His research activities deal with multimodal speech processing, with a special interest in assistive technologies that exploit speech articulatory gestures and physiological activities. He coauthored 17 articles in peer-reviewed international journals, more than 35 articles in peerreviewed international conferences, 3 book chapters and one patent. He received in 2011 the 6th Christian Benoit award (ISCA/AFCP). In 2017, he co-edited in IEEE/ACM Trans. Audio Speech and Language Processing the special issue on Biosignal-based speech processing.
Articulatory-acoustic modeling for assistive speech technologies: a focus on silent speech interfaces and biofeedback systems
Speech production is a complex motor process involving several physiological phenomena, such as neural, nervous and muscular activities that drive our respiratory, laryngeal and articulatory systems. Over the last 15 years, an increasing number of studies have proposed to rely on these activities to build devices that could restore oral communication when a part of the speech production chain is damaged, or that could help rehabilitate speech sound disorders. In this talk, I will focus on two lines of research 1) silent speech interfaces converting speech articulatory movements into text or synthetic speech, and 2) biofeedback systems providing tongue visual information for speech therapy and language learning. I will give anoverview of the literature on these fields that face common challenges and methodological frameworks. I will present some of our recent contributions with a focus on experimental techniques to capture multimodal speech-related signals, machine learning algorithms to model articulatory-acoustic relationships, and clinical evaluation of real-time prototypes.
Contact: livescu [AT] ttic.edu
Karen Livescu is an Associate Professor at TTI-Chicago. She completed her PhD in electrical engineering and computer science at MIT. Her main research interests are in speech and language processing, as well as related problems in machine learning. Her recent work includes unsupervised and multi-view representation learning, acoustic word embeddings, visually grounded speech modeling, and automatic sign language recognition. She is a 2021 IEEE SPS Distinguished Lecturer. Other recent professional activities include serving as a program chair of ICLR 2019, a technical chair of ASRU 2015/2017/2019, and Associate Editor for IEEE T-PAMI and IEEE OJ-SP.
Toward Understanding Open-Domain Sign Language in Natural Environments
Research on sign language video processing has made exciting progress, including on tasks like sign-to-gloss transcription and sign-to-written language translation. In this talk I will describe our work on several fronts, aimed at handling a broader set of domains and visual conditions in sign language video. We have collected several datasets of American Sign Language videos drawn from online sign language media. One important capability needed for open-domain sign language processing is the handling of fingerspelling, a component of sign language in which a word is spelled out through a sequence of letter-specific signs. Fingerspelling is frequently used to sign names and other important words that have no lexical signs. I will describe some of our work on detection and recognition of fingerspelling sequences, using our recently collected Chicago Fingerspelling in the Wild (ChicagoFSWild, ChicagoFSWild+) datasets. I will also describe our ongoing collection and annotation of the OpenASL dataset for ASL-to-English translation, starting from online captioned ASL media, and our development of large- and open-vocabulary translation models using this dataset. Along the way, I will present several technical strategies that we have explored for handling the challenges of natural sign language video.
Contact: O.E.Scharenborg@tudelft.nl
Multimedia Computing Group
Faculty of Electrical Engineering, Mathematics, and Computer Science
Delft University of Technology
The Netherlands
Twitter: Oscharenborg
Website: https://odettescharenborg.wordpress.com
Odette Scharenborg is an Associate Professor and Delft Technology Fellow at Delft University of Technology, the Netherlands. Her research interests focus on narrowing the gap between automatic and human spoken-word recognition. Particularly, she is interested in the question where the difference between human and machine recognition performance originates, and whether it is possible to narrow this performance gap. In her research she combines different research methodologies ranging from human listening experiments to computational modelling and deep learning. Odette co-organized the Interspeech 2008 Consonant Challenge, which aimed at promoting comparisons of human and machine speech recognition in noise. In 2017, she was elected onto the ISCA board, and in 2018 onto the IEEE Speech and Language Processing Technical Committee. She is an associate editor of IEEE Signal Processing Letters and a member of the European Laboratory for Learning and Intelligent Systems (ELLIS) unit Delft. She has served as area chair of Interspeech since 2015 and currently is on the Technical Programme Committee of Interspeech 2021 Brno.
Reaching over the gap: Cross- and interdisciplinary research on human and automatic speech processing.
The fields of human speech recognition (HSR) and automatic speech recognition (ASR) both investigate parts of the speech recognition process and have word recognition as their central issue. Although the research fields appear closely related, their aims and research methods are quite different. Despite these differences there is, however, in the past two decades a growing interest in possible cross-fertilisation. Researchers from both ASR and HSR are realising the potential benefit of looking at the research field on the other side of the ‘gap’. In this survey talk, I will provide an overview of past and present efforts to link human and automatic speech recognition research and present an overview of the literature describing the performance difference between machines and human listeners. The focus of the talk is on the mutual benefits to be derived from establishing closer collaborations and knowledge interchange between ASR and HSR.
Towards Inclusive automatic speech recognition.
Automatic speech recognition (ASR) is increasingly used, e.g. in emergency response centres, domestic voice assistants (e.g., Google Home), and search engines. They work (really) well for standard speakers of languages for which there is enough training data to train the systems. However, practice and recent evidence suggests that state-of-the-art ASRs struggle with the large variation in speech due to e.g., gender, age, speech impairment, race, and accents, i.e., ASR systems tend to not work well for "non-standard" speakers of the language.
In this talk, I will present our recent research on inclusive automatic speech recognition, i.e., ASR systems that work for everyone, irrespective of how someone speaks or the language that person speaks. The overarching goal in our new research line on inclusive automatic speech recognition is to uncover bias in ASR systems in order to work towards proactive bias reduction in ASR. In this talk, I will provide an overview of possible factors that can cause this bias; I will present systematic experiments aimed at quantifying and locating the bias of state-of-the-art ASRs on speech from different groups of speakers; and I will present recent research on reducing the bias against non-native accented Dutch. In our work, we focus on bias against gender, age, regional accents and non-native accents.
Contact: shri@ee.usc.edu
University of Southern California, Los Angeles, CA
Signal Analysis and Interpretation Laboratory
https://sail.usc.edu/people/shri.html
Shrikanth (Shri) Narayanan is University Professor and Niki & C. L. Max Nikias Chair in Engineering at the University of Southern California, where he is Professor of Electrical & Computer Engineering, Computer Science, Linguistics, Psychology, Neuroscience, Pediatrics, and Otolaryngology—Head & Neck Surgery, Director of the Ming Hsieh Institute and Research Director of the Information Sciences Institute. Prior to USC he was with AT&T Bell Labs and AT&T Research. His research focuses on human-centered information processing and communication technologies. He is a Fellow of the National Academy of Inventors, the Acoustical Society of America, IEEE, ISCA, the American Association for the Advancement of Science (AAAS), the Association for Psychological Science, and the American Institute for Medical and Biological Engineering (AIMBE). He is a recipient of several honors including the 2015 Engineers Council’s Distinguished Educator Award, a Mellon award for mentoring excellence, the 2005 and 2009 Best Journal Paper awards from the IEEE Signal Processing Society and serving as its Distinguished Lecturer for 2010-11, a 2018 ISCA CSL Best Journal Paper award, and serving as an ISCA Distinguished Lecturer for 2015-16, Willard R. Zemlin Memorial Lecturer for ASHA in 2017, and the Ten Year Technical Impact Award in 2014 and the Sustained Accomplishment Award in 2020 from ACM ICMI. He has published over 900 papers and has been granted seventeen U.S. patents. His research and inventions have led to technology commercialization including through startups he co-founded: Behavioral Signals Technologies focused on the telecommunication services and AI based conversational assistance industry and Lyssn focused on mental health care delivery, treatment and quality assurance.
Sounds of the human vocal instrument
The vocal tract is the universal human instrument played with great dexterity to produce the elegant acoustic structuring of speech, song and other sounds to communicate intent and emotions. The sounds produced by the vocal instrument also carry crucial information about individual identity and the state of health and wellbeing. A longstanding research challenge has been in improving the understanding of how vocal tract structure and function interact, and notably in illuminating the variant and invariant aspects of speech (and beyond) within and across individuals. The first part of the talk will highlight engineering advances that allow us to perform investigations on the human vocal tract in action-- from capturing the dynamics of vocal production using novel real-time magnetic resonance imaging to machine learning based articulatory-audio modeling--to offer insights about how we produce sounds with the vocal instrument. The second part of the talk will highlight some scientific, technological and clinical applications using such multimodal data driven approaches in the study of the human vocal instrument.
Computational Media Intelligence: Human-centered Machine Analysis of Media
Media is created by humans for humans to tell stories. There is a natural and imminent need that exists for creating human-centered media analytics to illuminate the stories being told and to understand their human impact. Objective rich media content analysis has numerous applications to different stakeholders: from creators and decision/policy makers to consumers. Advances in multimodal signal processing and machine learning enable detailed and nuanced characterization of media content: of what, who, how, and why, and help understand and predict impact, both individual (emotional) experiences and broader societal consequences.
Emerging advances have enabled us to measure the various multimodal facets of media and answer these questions on a global scale. Today, deep learning algorithms can analyze entertainment media (movies, TV) and quantify gender, age and race representations and measure how often women and underrepresented minorities appear in scenes or how often they speak to create awareness in objective ways not possible before. Text mining algorithms and natural language processing (NLP) can understand language use in movie scripts, and dialog interactions to track patterns of who is interacting with whom and how, and study trends in their adoption by different communities. Moreover, advances in human sensing allow for directly measuring the influence and impact of media on an individual’s physiology (and brain), while progress in social media measurements enable tracking the spread and social impact of media content on social communities.
This talk will focus on the opportunities and advances in human-centered media intelligence drawing examples from media for entertainment (e.g., movies) and commerce (e.g., advertisements). It will highlight multimodal processing of audio, video and text streams and other metadata associated with the content creation to provide insights media stories including any human-centered trends and patterns such as unconscious biases along dimensions such as about gender, race and age, as well as associated social e.g., violence and commercial aspects e.g., box office returns, relatable to media content.
Multimodal Behavioral Machine Intelligence for health applications
The convergence of sensing, communication and computing technologies — most dramatically witnessed in the global proliferation of smartphones, and IoT deployments — offers tremendous opportunities for continuous acquisition, analysis and sharing of diverse, information-rich yet unobtrusive time series data that provide a multimodal, spatiotemporal characterization of an individual’s behavior and state, and of the environment within which they operate. This has in turn enabled hitherto unimagined possibilities for understanding and supporting various aspects of human functioning in realms ranging from health and well-being to job performance.
These include data that afford the analysis and interpretation of multimodal cues of verbal and non-verbal human behavior to facilitate human behavioral research and its translational applications in healthcare. These data not only carry crucial information about a person’s intent, identity and trait but also underlying attitudes, emotions and other mental state constructs. Automatically capturing these cues, although vastly challenging, offers the promise of not just efficient data processing but in creating tools for discovery that enable hitherto unimagined scientific insights, and means for supporting diagnostics and interventions.
Recent computational approaches that have leveraged judicious use of both data and knowledge have yielded significant advances in this regard, for example in deriving rich, context-aware information from multimodal signal sources including human speech, language, and videos of behavior. These are even complemented and integrated with data about human brain and body physiology. This talk will focus on some of the advances and challenges in gathering such data and creating algorithms for machine processing of such cues. It will highlight some of our ongoing efforts in Behavioral Signal Processing (BSP)—technology and algorithms for quantitatively and objectively understanding typical, atypical and distressed human behavior—with a specific focus on communicative, affective and social behavior. The talk will illustrate Behavioral Informatics applications of these techniques that contribute to quantifying higher-level, often subjectively described, human behavior in a domain-sensitive fashion. Examples will be drawn from mental health and well being realms such as Autism Spectrum Disorder, Couple therapy, Depression, Suicidality, and Addiction counseling.
abradlow@northwestern.edu
Ann Bradlow received her PhD in Linguistics from Cornell University in 1993. She completed postdoctoral fellowships in Psychology at Indiana University (1993-1996) and Hearing Science at Northwestern University (1996-1998). Since 1998, Bradlow has been a faculty member in the Linguistics Department at Northwestern University (USA) where she directs the Speech Communication Research Group (SCRG). The SCRG pursues an interdisciplinary research program in acoustic phonetics and speech perception with a focus on speech intelligibility under conditions of talker-, listener-, and situation-related variability. A central line of current work investigates causes and consequences of divergent patterns of first-language (L1) and second-language (L2) speech production and perception.
Second-language Speech Recognition by Humans and Machines
This presentation will consider the causes, characteristics, and consequences of second-language (L2) speech production through the lens of a talker-listener alignment model. Rather than focusing on L2 speech as deviant from the L1 target, this model views speech communication as a cooperative activity in which interlocutors adjust their speech production and perception in a bi-directional, dynamic manner. Three lines of support will be presented. First, principled accounts of salient acoustic-phonetic markers of L2 speech will be developed with reference to language-general challenges of L2 speech production and to language-specific L1-L2 structural interactions. Next, I will examine recognition of L2 speech by listeners from various language backgrounds, noting in particular that for L2 listeners, L2 speech can be equally (or sometimes, more) intelligible than L1 speech. Finally, I will examine perceptual adaptation to L2 speech by L1 listeners, highlighting studies that focused on interactive, dialogue-based test settings where we can observe the dynamics of talker adaptation to the listener and vice versa. Throughout this survey, I will refer to current methodological and technical developments in corpus-based phonetics and interactive testing paradigms that open new windows on the dynamics of speech communication across a language barrier.
Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar in Georgia institute of technology, Atlanta, GA in 2009, a Senior Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017, and an associate research professor at Johns Hopkins University, Baltimore, MD from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has been published more than 200 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He served as an Associate Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP).
Tackling Multispeaker Conversation Processing based on Speaker Diarization and Multispeaker Speech Recognition
Recently, speech recognition and understanding studies have shifted their focus from single-speaker automatic speech recognition (ASR) in controlled scenarios to more challenging and realistic multispeaker conversation analysis based on ASR and speaker diarization. The CHiME speech separation and recognition challenge is one of the attempts to tackle these new paradigms. This talk first describes the introduction and challenge results of the latest CHiME-6 challenge, focusing on recognizing multispeaker conversations in a dinner party scenario. The second part of this talk is to tackle this problem based on an emergent technique based on an end-to-end neural architecture. We first introduce an end-to-end single-microphone multispeaker ASR technique based on a recurrent neural network and transformer to show the effectiveness of the proposed method. Second, we extend this approach to leverage the benefit of the multi-microphone input and realize simultaneous speech separation and recognition within a single neural network trained only with the ASR objective. Finally, we also introduce our recent attempts of speaker diarization based on end-to-end neural architecture, including basic concepts, on-line extensions, and handling unknown numbers of speakers.
Biography
Prof. Giuseppe Riccardi is founder and director of the Signals and Interactive Systems Lab at University of Trento, Italy. Prof. Riccardi has co-authored more than 200 scientific papers. He holds more than 90 patents in the field of automatic speech recognition, understanding, machine translation, natural language processing and machine learning. His current research interests are natural language modeling and understanding, spoken/multimodal dialogue, affective computing, machine learning and social computing.
Prof. Riccardi has been on the scientific and organizing committee of EUROSPEECH, INTERSPEECH, ICASSP, ASRU, SLT, NAACL, EMNLP, ACL and EACL. He has been elected member of the IEEE SPS Speech Technical Committee (2005-2008). He is a member of ACL, ACM and elected Fellow of IEEE (2010) and of ISCA (2017).
Empathy in Human Spoken Conversations
Empathy will be a critical ability of next-generation conversational agents. Empathy, as defined in behavioral sciences such as psychology, expresses the ability of human beings to recognize, understand and react to sensations, emotions, attitudes and beliefs of others. However, most computational speech and language research is limited to the emotion recognition ability only. We aim at reviewing the behavioral constructs of empathy, the acoustic and linguistic manifestations and its interaction with basic emotions. In psychology, there is no operational definition of empathy, which makes it vague and difficult to measure. In this talk, we review and evaluate a recently proposed categorical annotation protocol for empathy. This annotation protocol has been applied to a large corpus of real-life, dyadic natural spoken conversations. We will review the behavioral signal analysis of patterns of emotions and empathy.
Contact details: amalia.arvaniti@ru.nl
Biopic: Amalia Arvaniti holds the Chair of English Language and Linguistics at Radboud University, Netherlands. She had previously held research and teaching appointments at the University of Kent (2012-2020), UC San Diego (2001-2012), the University of Cyprus (1995-2001), as well as Cambridge, Oxford, Edinburgh. Her research, which focuses on the cross-linguistic study of prosody, has been widely published and cited, and has led to paradigm-shifts in our understanding of speech rhythm and intonation. Her current research on prosody focuses on intonation and is supported by an ERC-funded grant (ERC-ADG-835263; 2019-2024) titled Speech Prosody in Interaction: The form and function of intonation in human communication (SPRINT). The aim of SPRINT is to better understand the nature of intonational representations and the role of pragmatics and phonetic variability in shaping them in order to develop a phonological model of intonation that takes into consideration phonetic realization on the one hand, and intonation pragmatics on the other.
Forty years of the autosegmental-metrical theory of intonational phonology: an update and critical review in light of recent findings
It has been 40 years since Pierrehumbert’s seminal dissertation on “The phonology and phonetics of English intonation” which marked the beginning of the autosegmental-metrical theory of intonational phonology (henceforth AM). The success of AM has led to an explosion of research on intonation, but also brought a number of problems, such as the frequent conflation of phonetics and phonology and a return to long-questioned views on intonation. In this talk, I will first review the fundamental tenets of AM and address some common misconceptions that often lead to faulty comparisons with other models and questionable practices in intonation research more generally. I will also critically appraise the success of AM and review results emerging in the past decade, including results from my own recent research on English and Greek. These results suggest that some assumptions and research practices in AM and intonation research in general need to be reconsidered if we are to gain insight into the structure and functions of intonation crosslinguistically.
io: Eric Fosler-Lussier is a Professor of Computer Science and Engineering, with courtesy appointments in Linguistics and Biomedical Informatics, at The Ohio State University. He is also co-Program Director for the Foundations of Artificial Intelligence Community of Practice at OSU's Translational Data Analytics Institute. After receiving a B.A.S. (Computer and Cognitive Science) and B.A. (Linguistics) from the University of Pennsylvania in 1993, he received his Ph.D. in 1999 from the University of California, Berkeley. He has also been a Member of Technical Staff at Bell Labs, Lucent Technologies, and held visiting positions at Columbia University and the University of Pennsylvania. He currently serves as the IEEE Speech and Language Technical Committee Chair and was co-General Chair of ASRU 2019 in Singapore. Eric's research has ranged over topics in speech recognition, dialog systems, and clinical natural language processing, which has been recognized in best paper awards from the IEEE Signal Processing Society and the International Medical Informatics Association.
Low resourced but long tailed spoken dialogue system building
In this talk, I discuss lessons learned from our partnership with the Ohio State School of Medicine in developing a Virtual Patient dialog system to train medical students in taking patient histories. The OSU Virtual Patients unusual development history as a question-answering system provides some interesting insights into co-development strategies for dialog systems. I also highlight our work in “speechifying” the patient chatbot and handling semantically subtle questions when speech data is non-existent and language exemplars for questions are few.
Bio: Heiga Zen received his PhD from the Nagoya Institute of Technology, Nagoya, Japan, in 2006. He was an Intern/Co-Op researcher at the IBM T.J. Watson Research Center, Yorktown Heights, NY (2004--2005), and a Research Engineer at Toshiba Research Europe Ltd. Cambridge Research Laboratory, Cambridge, UK (2008--2011). At Google, he was in the Speech team from July 2011 to July 2018, then joined the Brain team from August 2018. His research interests include speech technology and machine learning.
Title: Model-based text-to-speech synthesis
Ralf Schlüter serves as Academic Director and Lecturer (Privatdozent) in the Department of Computer Science of the Faculty of Computer Science, Mathematics and Natural Sciences at RWTH Aachen University. He leads the Automatic Speech Recognition Group at the Lehrstuhl Informatik 6: Human Language Technology and Pattern Recognition. He studied physics at RWTH Aachen University and Edinburgh University and received his Diploma in Physics (1995), in Computer Science (2000) and Habilitation for Computer Science (2019), each at RWTH Aachen University. Dr. Schlüter works on all aspects of automatic speech recognition and has been leading the scientific work of the Lehrstuhl Informatik 6 in the area of automatic speech recognition in many large national and international research projects, e.g.\ EU-Bridge and TC-STAR (EU), Babel (US-IARPA) and Quaero (French OSEO)
Automatic Speech Recognition in a State-of-Flux
Initiated by the successful utilization of deep neural network modeling for large vocabulary automatic speech recognition (ASR), the last decade brought a considerable diversification of ASR architectures. Following the classical state-of-the-art hidden Markov model (HMM) based architecture, connectionist temporal classification (CTC), attention-based encoder-decoder, recurrent neural network transducer (RNN-T) and monotonic variants, as well as segmental approaches including direct HMM architectures were introduced. All these architectures show competitive performance and the question arises, which of these will finally prevail and define the new state-of-the-art in large vocabulary ASR? In this presentation, a comparative review of current architectures in the context of Bayes decision rule is provided. Relations and equivalences between architectures are derived, utilization of data is considered and the role of language modeling within integrated end-to-end architectures will be discussed.
Bio: Nancy F. Chen (SM) is a senior scientist, principal investigator, and group leader at I2R (Institute for Infocomm Research), A*STAR (Agency for Science, Technology, and Research), Singapore. Dr. Chen’s research focuses on conversational artificial intelligence (AI) and natural language generation with applications in education, healthcare, journalism, and defense. Speech evaluation technology developed by her team is deployed at the Ministry of Education in Singapore to support home-based learning to tackle challenges that arose during the COVID-19 pandemic. Dr. Chen also led a cross-continent team for low-resource spoken language processing, which was one of the top performers in the NIST (National Institute of Standards and Technology) Open Keyword Search Evaluations (2013-2016), funded by the IARPA (Intelligence Advanced Research Projects Activity) Babel program. Prior to I2R, A*STAR, Dr. Chen worked at MIT Lincoln Laboratory on multilingual speech processing and received her Ph.D. from MIT and Harvard in 2011.
Dr. Chen has received numerous awards, including IEEE SPS Distinguished Lecturer (2023), Singapore 100 Women in Tech (2021), Young Scientist Best Paper Award at 2021 MICCAI (Medical Image Computing and Computer Assisted Interventions), Best Paper Award at 2021 SIGDIAL (Special Interest Group on Discourse and Dialogue), the 2020 Procter & Gamble (P&G) Connect + Develop Open Innovation Award, the 11th L’Oréal UNESCO (United Nations Educational, Scientific and Cultural Organization) For Women in Science National Fellowship (2019), 2016 Best Paper Award at APSIPA (Asia-Pacific Signal and Information Processing Association), 2012 Outstanding Mentor Award from the Ministry of Education in Singapore, the Microsoft-sponsored IEEE Spoken Language Processing Grant (2011), and the NIH (National Institute of Health) Ruth L. Kirschstein National Research Award (2004-2008).
Dr. Chen has been very active in the international research community. She is program chair of ICLR (International Conference on Learning Representations) 2023, and has been ISCA (International Speech Communication Association) Board Member (2021-2025) and an elected member of the IEEE Speech and Language Technical Committee (2016-2018, 2019-2021). Dr. Chen has served many editorial positions, including senior area editor of Signal Processing Letters (2021-2022), associate editors of IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020-2023), Neurocomputing (2020-2021), Computer Speech and Language (2021- present) and IEEE Signal Processing Letters (2019-2021) and guest editor for the special issue of “End-to-End Speech and Language Processing” in the IEEE Journal of Selected Topics in Signal Processing (2017).
In addition to her academic endeavors, technology from her team has resulted in spin-off companies such as nomopai to help engage customers with confidence and empathy. Dr. Chen has also consulted for various companies ranging from startups to multinational corporations in the areas of climate change (social impact startup normal), emotional intelligence (Cogito Health), education technology (Novo Learning), speech recognition (Vlingo, acquired by Nuance), and defense and aerospace (BAE Systems).
Title: Controllable Neural Language Generation
Abstract: Natural language generation fuels numerous applications, including machine translation, summarization, and conversational agents in end-to-end neural modeling have injected many new research endeavors into. Advancements language generation. However, while neural models are extremely capable of generating fluent and grammatically correct text, many challenges still remain in ensuring factual correctness to minimize hallucination so model outputs can be more readily usable in deployment. In this talk, we will share some case studies in summarization and dialogue technology to illustrate how we can better harness the potential of neural language generation.
Title: Summarizing Conversations: From Meetings to Social Media Chats
Abstract: Human-to-human conversation is a dynamic process of information exchange between multiple parties that unfolds and evolves over time. It remains one of the most natural means of how humans pass down knowledge, increase mutual understanding, convey emotions, and collaborate with one another. However, it is nontrivial to automatically distill such unstructured information into summaries useful for downstream tasks. In this talk we will examine recent approaches and applications in addition to discussing future trends.