Critical Thinking

V. matters is whether the word is present

V. Methods Text feature extraction is the process of taking out a list of words from the text data and then transforming them into a feature set which is usable by a classifier. This work emphasizes on the review of available feature extraction methods. The following techniques can be used for extracting features from text data. A. Bag of words: The bag of words is the most common and the simplest among all the other feature extraction methods; it forms a word presence feature set from all the words of an instance. It is known as a “bag” of words, since the method doesn’t care about how many times a word occurs or the order of the words, all what matters is whether the word is present in a list of words. The features can be used in modelling with machine learning algorithms. This method is very flexible and simple. It is usually used for extracting features from text data in various ways. A bag of words is the presentation of text data. It specifies the frequency of words in the document. It includes: 1. A lexicon of known words 2. A frequency of the existence of those known words. The complexity of bag of words model is both in determining how to score the presence of familiar words and how to design the vocabulary of familiar words. B. TF-IDF A problem with bag of words approach is that the words with higher frequency becomes dominant in the data. These words may not provide much information for the model. And due to this problem domain specific words which does not have larger score may be discarded or ignored. To resolve this problem, the frequency of the words is rescaled by considering how frequently the words occur in all the documents. Due to this, the scores for frequent words are also frequent among all the documents are reduced. This way of scoring is known as Term Frequency – Inverse Document Frequency. • Term Frequency (TF) is the frequency of the word in the current document.  • Inverse Document Frequency (IDF) is the score of the words among all the documents. These scores can highlight the words that are unique that is the words that represent needful information in a specified document. Therefore the IDF of an infrequent term is high, and the IDF of a frequent term is low. C. Word2Vec Word2Vec is used to construct word embeddings. The models created by using word2vec are shallow meaning two-layer neural networks. Once trained, they reproduce semantic contexts of words. The model takes a huge corpus of text as an input. It then creates a vector space which is usually of hundreds of dimensions. Each distinctive word in the corpus is alloted with corresponding vector in the space. The words with common contexts are placed in near proximity in vector space. Word2vec can use one of the two architectures: continuous skip gram or continuous bag of words (CBOW). In the continuous skip gram, the current word is considered to predict the neighbouring window of context words. In this architecture the nearby context words are considered more heavily than words with distant context. In the continuous bag of words architecture, the sequence of context words does not impact the prediction as it is based on bag of words model. VI. Experimentation Results In this work, for experimentation, the dataset is taken and text features are extracted using Bag of words, TF-IDF and Word2vec techniques. The extracted text features are then evaluated for text classification. The effectiveness of transformed free text is evaluated using logistic regression and random forest classifier with 3-fold stratified cross-validation. A. Logistic Regression Logistic regression is used classification problems. It is used to predict the group in which the current object under consideration belongs to. Classification is portioning the data into groups based on particular features. Most commonly used example of LR, Suppose a tumor needs to be classified as benign or malignant based on various features like location of the tumor, size of the tumore, etc. Logistic regression is named for the function called the logistic function. The logistic function is also called as the sigmoid function. In LR, Y is the dependent variable which has G = 2 (usually) distinct values. It is reverted on a set of  p independent variables X1, X2, … , Xp . Suppose Y is a condition after surgery, absence or presence of a disease, or marital status. Because the names of these divisions are arbitrary, they are referred by successive numbers. That is, Y will have values 1, 2, … G. Let   The LR model is given by, where,  Pg is the probability of an individual having values  X1, X2, …, Xp in outcome  g . That is, Pg =Pr(Y=g|X) The  P1,P2,…,PG are the prior probabilities of outcome. If the prior probabilities are equal, then the term  ln(Pg / P1 )  becomes zero. If the prior probabilities are not equal, the values of the intercepts in the LR equation are changed. In the logits of  p, these equations are linear. But considering the probabilities, the equations are nonlinear. The nonlinear equations are These models are called as logistic regression. A. Random Forest Classifier The random decision forests or random forests are an object learning technique for used for classification. It creates a multitude of decision trees during training of the data. It then gives the output class that is the classification of individual trees. The decision tree algorithm sometimes overfits the training dataset. In such cases RF is used as the correction to it. It is a supervised classification algorithm. The RF classifier constructs a set of decision trees from randomly selected subset of training set. Then the votes from the different decision trees are aggregated to decide final class of the test object. This algorithm creates the forest with a number of trees. Generally, the more trees in the forest the