I’m working as an Applied Scientist-2 at Amazon with International Machine Learning team. I’m building (and improving) multiple NLU components for Conversational Shopping Assistant at Amazon.

Previously, I have worked as a Machine Learning Scientist at Jio Haptik on fundamental Conversational-AI problems. I built the Intent Detection System for Haptik’s NLU Engine, which was 25% more accurate than their previous system, owning it from Research to Production.

I have authored research papers which have been accepted at top tier venues like ACL (Findings), EMNLP NLP-OSS workshop, EMNLP Insights workshop, EACL LT-EDI workshop and FIRE.

I am also the creator of open source iNLTK library which provides out of the box support for various NLP tasks, for low resource 13 Indic Languages. The library has 100,000+ downloads, 700+ stars and 100+ forks on GitHub.

Prior to Jio Haptik, I worked at Goldman Sachs with the User Experience and Productivity team on Analytics for Desktop Assistant, which is firm-wide used productivity tool.

I have Masters in Computer Science with specialization in ML from Georgia Tech and Bachelor’s in Computer Science from PEC University of Technology.

I am interested in the applications of Machine Learning to solve problems which will impact millions and keep making my little open source contributions towards it.

Selected Publications

Google Scholar

Accepted at ACL 2023 (Findings) CoMix: Guide transformers to code-mix using POS structure and phonetics
Gaurav Arora, Srujana Merugu, Vivek Sembium
[Paper]
Accepted at EMNLP-2020 NLP-OSS workshop iNLTK: Natural Language Toolkit for Indic Languages
Gaurav Arora
[Paper] [GitHub]
Accepted at EMNLP-2020 Insights workshop HINT3: Raising the bar for Intent Detection in the Wild
Gaurav Arora, Chirag Jain, Manas Chaturvedi, Krupal Modi
[Paper] [GitHub]
Accepted at Dravidian Codemix HASOC @ FIRE-2020 Pre-training ULMFiT on Synthetically Generated Code-Mixed Data for Hate Speech Detection
Gaurav Arora
[Paper] [GitHub]
Accepted at LT-EDI @ EACL-2021 Spartans@ LT-EDI-EACL2021: Inclusive Speech Detection using Pretrained Language Models
Gaurav Arora*, Megha Sharma*
[Paper] [GitHub]

Education

2020 - 2022 Masters in Computer Science (specialization in ML)
Georgia Institute of Technology (Georgia Tech)
2014 - 2018 B.Tech in Computer Science
PEC University of Technology
2012 - 2014 GMSSS-16, Chandigarh

Industry Experience

Apr 2021 - Present Amazon, Applied Scientist
July 2019 - Apr 2021 Jio Haptik, Machine Learning Scientist
June 2018 - July 2019 Goldman Sachs, Technology Analyst
May 2017 - Oct 2017 Goldman Sachs, Technology Analyst Intern
Nov 2016 - Mar 2018 Researchshala, Co-Founder and CTO

Open Source Contributions

Natural Language Toolkit for Indic Languages (iNLTK)

Star Fork Watch

• iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages
• iNLTK provides Data Augmentation, Sentence Similarity, Sentence Encoding, Word Embedding, Tokenization and Text Generation utilities for low resource 13 Indic Languages
• The library is backed by ULMFiT Language Models which I had trained using Fastai and Pytorch libraries, producing SOTA LM perplexity and Classification accuracy in 13 Indic Languages

Appreciation for iNLTK
• By Jeremy Howard, Sebastian Ruder on Twitter
Shared a lot by community on LinkedIn
• iNLTK has 100,000+ Downloads on PyPi
• Data Augmentation post about iNLTK was trending on LinkedIn
• iNLTK was trending on GitHub in May, 2019
• Shared on Reddit, Facebook, Quora etc by the community

Code with AI

Star Fork Watch

Tool which predicts which techniques one should use to solve a competitive programming problem to get correct answer
Demo video on YouTube

Appreciation for Code with AI
• By Jeremy Howard on Twitter
• By community on Codeforces
• The tool has been used by 3000+ users

NLP for Hindi

Star Fork Watch

• Contains SOTA Language models and Classifier for Hindi
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

NLP for Sanskrit

Star Fork Watch

• Contains SOTA Language models and Classifier for Sanskrit
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

NLP for Nepali

Star Fork Watch

• Contains SOTA Language models and Classifier for Nepali
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

NLP for Tamil

Star Fork Watch

• Contains SOTA Language models and Classifier for Tamil
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

NLP for Bengali

Star Fork Watch

• Contains SOTA Language models and Classifier for Bengali
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

NLP for Punjabi

Star Fork Watch

• Contains SOTA Language models and Classifier for Punjabi
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

NLP for Malayalam

Star Fork Watch

• Contains SOTA Language models and Classifier for Malayalam
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

NLP for Odia

Star Fork Watch

• Contains SOTA Language models and Classifier for Odia
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

NLP for Gujarati

Star Fork Watch

• Contains SOTA Language models and Classifier for Gujarati
• Pretrained Models available for download: TransformerXL, ULMFiT




[ Code ] [ Results ] [ Dataset ] [ Embeddings projection ]

Honors & Awards

Mar 2021 Indian Achievers Award 2020 from Indian Achiever’s Forum (IAF) in the Young Achievers Category for contribution in nation building through iNLTK
Mar 2019 Fast.ai International Fellow for contributions to Fast.ai forums
Dec 2018 Top-17% rank in Human Protein Atlas Image Classification, Kaggle for developing Deep Learning model which classified mixed patterns of proteins in microscope images. The competition had 2172 teams, but I participated individually and hence had 100% contribution in the 366th placed solution
Oct 2017 1st Prize in IEEE-Hackathon for developing chat-bot to help people with emotional decisions in life
Feb 2016 Top-100 among 500,000 students in IT-Olympiad,2016.
Oct 2016 2nd-Prize in IEEE-Hackathon for developing an Augmented reality application to help teachers
Mar 2016 All India Rank-6 in IEEE Programming League, among over 1200 undergraduate students
Mar 2016 2nd Rank, CodeWars,a competitive-programming event hosted by IEEE,PEC on CodeChef
Nov 2016 - Mar 2018 Research Scholarship of 10k per month for Personal Emotional Doctor - Bot
May 2014 All India Rank-885 in JEE-Mains, among 1.4 million candidates
Aug 2014 1st Rank-Opener, PEC for best JEE-Mains rank among 600 students of the session 2014-2018
Dec 2014 1 Lakh Scholarship from CBSE for 96.4% marks in 12th Boards and 10 CGPA in 10th
Dec 2014 Letter of Appreciation from HRD Ministry,Govt. of India for 96.4% in CBSE-12th exams
June 2011 Catch Them Young - was among the top-40 students selected from tricity by INFOSYS for 2-week Programming-Basics training on their campus

Skills and Courses

Mathematics

Discrete Structures for Computer Science, Vector Calculus, Fourier Series and Laplace Transform, Operation Research, Bayesian Statistics

Computer Science

Introduction to Graduate Algorithms, Data Structures and Algorithms, Computer Architecture and Organization, OOP, Microprocessor, DBMS, OperatingSystems, Computer Networks, Theory of Computation, Artificial Intelligence, Computer Graphics, Mobile Computing, Machine Learning, Reinforcement Learning, Deep Learning, Computer Vision, Big Data for Healthcare, Knowledge Based AI

Programming & Web

C, C++, Python, Javascript, TypeScript, EcmaScript6, AngularJS, ReactJS, Angular4, Webpack, Django with Python

Frameworks

Pytorch, Pandas, Numpy, ScikitLearn, SciPy, Fastai, Transformers library


Last updated on 2021-10-03