Rapid Enhancement of NLP Systems by Acquisition of Data in Correlated Domains

Tejas Udayakumar, Kinnera Saranu, Mayuresh Sanjay Oak, Ajit Ashok Saunshikar, Sandip Shriram Bapat


In a generation where industries are going through a paradigm shift because of the rampant growth of deep learning, structured data plays a crucial role in the automation of various tasks. Textual structured data is one such kind which is extensively used in systems like chat bots and automatic speech recognition. Unfortunately, a majority of these textual data available is unstructured in the form of user reviews and feedback, social media posts etc. Automating the task of categorizing or clustering these data into meaningful domains will reduce the time and effort needed in building sophisticated human-interactive systems. In this paper, we present a web tool that builds a domain specific data based on a search phrase from a database of highly unstructured user utterances. We also show the usage of Elasticsearch database with custom indexes for full correlated text-search. This tool uses the open sourced Glove model combined with cosine similarity and performs a graph based search to provide semantically and syntactically meaningful corpora. In the end, we discuss its applications with respect to natural language processing.


Cite as: Udayakumar, T., Saranu, K., Oak, M.S., Saunshikar, A.A., Bapat, S.S. (2020) Rapid Enhancement of NLP Systems by Acquisition of Data in Correlated Domains. Proc. Interspeech 2020, 1017-1018.


@inproceedings{Udayakumar2020,
  author={Tejas Udayakumar and Kinnera Saranu and Mayuresh Sanjay Oak and Ajit Ashok Saunshikar and Sandip Shriram Bapat},
  title={{Rapid Enhancement of NLP Systems by Acquisition of Data in Correlated Domains}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={1017--1018}
}