Learning Machines: Natural Language Processing Explained with Diyi Yang

Learning Machines_headerimage (2)

Welcome to Learning Machines, where we’ll talk with faculty members from the Machine Learning Center at Georgia Tech (ML@GT) about their main research area and the future of their work.

Today we talked with Diyi Yang, an assistant professor in ML@GT and the School of Interactive Computing.

Yang’s lab, Social and Language Technologies (SALT), combines natural language processing (NLP) and computational social science to bring NLP to the next level.

We asked Yang to explain NLP and give students advice on how to write a great research paper.

For those who are unfamiliar with natural language processing, can you explain what it is?

NLP is a field that studies how to enable computers to process, analyze and understand human natural language. It can be viewed as an interdisciplinary field of linguistics, computer science, and artificial intelligence.

Natural language is how we communicate with each other. Almost every aspect of our lives is recorded in a text format, from messages and emails, to newspapers and menus. With such a huge amount of text and data, we need to understand what it means and how to develop methods to process and analyze this data.

You work primarily in computational social science and natural language processing (NLP.) What drew you to focusing on these two areas of expertise?

Over the last few decades, NLP has dramatically improved performance and produced industrial applications like machine translation and personal assistants. Despite enabling these applications, the current NLP systems and research largely ignore the social part of language.

As a result, systems still cannot communicate naturally with humans and are limited in their functionality and growth. This results in biases towards specific populations in toxicity detection models and failure in machine learning systems to generate culturally polite and specific outputs.

My research goal is to study the content dimension of natural language and the social dimension of it (e.g., who says it, in what context, for what goals.) My goal is to build methods that can understand and support human-human/human-machine communication.

What kinds of problems does natural language processing help solve?

NLP mainly focuses on how to process, analyze, and understand natural language data. Common NLP tasks include document classification, text generation, summarization, conversational agents, and information extraction.

We interact with NLP frequently in our daily life. For instance, you might use the spam filtering function in your emails; talk to Siri, Alexa, or Google Assistant; use Translation tools to translate one language to another; search keywords via Internet search engine; or use the autocomplete function for text messages.

What are some common challenges or problems you run into when working on an NLP project?

In my opinion, one of the first challenges is the unobserved or undefined dimension of natural language phenomenon. For example, we can understand the subtle insights and signals from human communication, such as humor, implication and commonsense, but our NLP models cannot. We know what a message means, but our model cannot express the sentence meaning clearly and understand the context of the sentence. This introduces a lot of challenges for researchers on how to formulate these subtle aspects related to language systematically, and how to model them computationally.

The second challenge is the data variation and sparsity issue. Text data exists in so many languages and domains (e.g., Twitter, Reddit, newspapers, Wikipedia.) On the one hand, we have a huge amount of data. On the other hand, data with labels that our models can learn from are limited, especially for social related tasks such as predicting whether a given text message contains emotional support.

Can you tell us about a project you are either currently working on or one that you are particularly proud of?

One major focus in our lab is to develop better machine learning models for low resourced NLP tasks. Deep learning has achieved extremely good performance in most supervised learning settings. However, when there is only limited labeled data, these models often fail. This strong dependence on labeled data largely prevents neural network models from being applied to new settings or real-world situations.

To alleviate the dependencies of supervised models on labeled data, we are working on developing novel semi-supervised learning architectures. We are leveraging advances in data augmentation, adversarial training, and self-training, hence making models able to use both labeled and unlabeled data.

Another direction that I am interested in is studying the bias and fairness in language use and its impact on algorithms, people and societies. Our lab has developed natural language models to detect personally experienced racial discrimination online and annoying behaviors that people dislike on Twitter in order to better support people who have been negatively affected.

Our recent work looks at bias in text from the perspective of inappropriate subjectivity. This kind of bias erodes trust in media and fuels social conflict. Our work is capable of identifying, categorizing, and reducing bias in real-world news, books and political speeches according to human judgement, and do so better than state-of-the-art style transfer and machine translation systems. You can check out this podcast for more details.

You recently received the best paper honorable mentions at SIGCHI, and were named “Best Reviewer” at ICWSM. To you, what makes a paper great?

In my personal opinion, great work needs to focus on a research problem that matters, but may take different forms across different communities and contexts. It does not always need to be a fancy algorithm, a complicated model architecture, or a good result number.

There is also a slight difference between “great work” and “great paper”. Once the “great work” is done, one needs to organize it into a “great research paper”. This requires additional skills to describe your work to others so that the knowledge can be shared with a broader research field.

You primarily work with Ph.D. students, but are open to working with masters and undergraduate students. Why do you think it’s important for faculty to work with non-doctorate students on research?

Generally, I would like to get more students interested in computer science and NLP especially from women and other underrepresented groups. Some students might not know what the research process looks like or how it feels to work as a “scientist.” I think that getting non-doctorate students involved in research projects could at least help them figure out whether doing PhD or a career in STEM might be a good fit for them.

Personally, I enjoy working with undergrads and master students who are passionate about doing research in the intersection of NLP and computational social science. I’d like to get more students interested in solving big problems that matter and in using technology to make the world a better place.