Classification and Interchange of Informal and Formal English Text

Author: Seraph Jin Supervisors: prof. dr. ir. A.P. de vries Second supervisor: Agnieszka Szuba

The purpose of text categorization [1] is to classify text documents into different pre- defined classes. There is little published research into the classification of formal and informal (English) text. A systematic study on text formality classification is missing from the literature so far. The most relevant work about formal and informal text classification is done by Fadi Abu Sheikha and Diana Inkpen [2]. They classify formal and informal documents based on the features extracted from the data sets. Some short- comings of their research are discussed in the following paragraph. Based on analysis and our critical review of this research, we extend the binary classification model from the fore-mentioned research to a multi-class classification model. Instead of predicting a document as either formal or informal, we also work on a model that can predict the formality of a document on a scale from one to three, with degree one being very informal, two being semi-formal/informal, and degree three as very formal. In the end, we discuss which key factors contribute to the formality or informality of a text and provide some ideas or solutions to change the text style. Our objective is to achieve a sustainable and replicable model with a high accuracy with the flexibility to extend it for the future research.

The prior research [2] do not disclose the exact algorithm they used, and they extract the feature using Connexor parser [3] which is not convenient to use locally. The data sets used in the research are not optimal and inconsistent. For example, in the Enron email dataset, even though emails are usually informal in the setting of normal life, many emails in the Enron data sets are quite formal, as expected from a former huge company. Some of the data sets they used include annotations that are difficult to cleanse manually. The rest of the data sets are adequate for the classification, and we also use some of them in our heterogeneous dataset.

This research improves on the prior study and provide an implementation of the text classification models to assist Yoast for their future development in that field. Yoast [4] provides a SEO (search engine optimization) plugin for Wordpress and other other simi- lar platforms, which has been used across millions of websites. The function of the SEO plugin is to help users achieve a high position in search engine results. This includes offering feedback and suggestions for improvements to the content of their websites. In this paper, we focus on the formality side of writing since the styles of writing can influence the readers’ attitude towards the content. For example, when reading a fun story, one would expect the creator to make use of abbreviation, slang, colloquial language,etc.. On the other hand, in a newspaper, readers like to see more formal and serious text to build trust in the content of the newspaper. Since text formality and informality contribute greatly to the content of its users, Yoast would like to make the informal and formal classification a tool to assist the process of content creation. This thesis starts with a replication of the text formality classification method from the research [2] done by Fadi Abu Sheikha and Diana Inkpen. Afterwards, we extend the research to multi-class classification on heterogeneous datasets. The concept of the models is well structured to allow for modifications and follow-up research in the future.

Following [2], we extract features from each document and quantify the features as numerals. After the desired features for each documents are extracted, we train and verify the model with stratified k-fold cross validation. Afterwards, we examine the features to see which (subset of) features are essential when distinguishing formal and informal texts with a recursive feature selection method.

In the discussion section we compare several kinds of classifiers (Decision tree classifier, logistic regression classifier and random forest) in the experiment, analyze and discuss the classifiers based on two different datasets, and implement the mock-up classification algorithms. The choices made for the classifiers are limited by the JavaScript implementation. For decision tree model, we only need to distinguish different documents according to the feature splitting suggested by the decision tree model. In other words, the decision tree result could be utilized to create a rule-based algorithm. For logistic regression, we could apply the weight and bias to the features and get an approximate result to determine the type of document. In the end, we also discuss what makes documents formal and informal and how to interchange them.

The GitHub repository for this project can be found here.

Link to download the pdf page

[1] V. K. Vijayan, K. R. Bindu, and L. Parameswaran, “A comprehensive study of text classification algorithms,” in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2017, pp. 1109–1113.

[2] F. Abu Sheikha and D. Inkpen, “Automatic classification of documents by for- mality,” inProceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010), 2010, pp. 1–5.

[3] P. B. Fuß, , A. Kurczewska, S. Dyka, R. Mitkov, I. I. Siitonen, , K. K. Ruokolainen, , A. H. Tiedemann, , and J. X. Huang. [Online]. Available: http://www.connexor.com/

[4] T. Yoast, “Yoastseo,” Dec 2021. [Online]. Available: https://wordpress.org/ plugins/wordpress-seo/