# Looking at the distribtion of TotalIncome_logĭf.hist(bins=20) Data Preparation for Model Building: So instead of treating them as outliers, let’s try a log transformation to nullify their effect: # Perform log transformation of TotalIncome to make it closer to normalĭf = np.log(df) some people might apply for high value loans due to specific needs. The extreme values are practically possible, i.e. # Looking at the distribtion of TotalIncome # Replace missing value of Self_Employed with more frequent categoryĭf.fillna('No',inplace=True) Outliers of LoanAmount and Applicant Income: # Add both ApplicantIncome and CoapplicantIncome to TotalIncomeĭf = df + df Print(loan_approval_with_Credit_1*100) Output:ħ9.58 % of the applicants whose loans were approved have Credit_History equals to 1. # Loan approval rate for customers having Credit_History (1)ĭf=pd.crosstab(df, df, margins=True).apply(percentageConvert, axis=1) Pd.crosstab(df, df, margins=True) #Function to output percentage row wise in a cross table Loan_approval = df.value_counts()Ĥ22 number of loans were approved. Understanding Distribution of Categorical Variables: # Loan approval rates in absolute numbers LoanAmount has missing as well as extreme values, while ApplicantIncome has a few extreme values. But graduates with a very high incomes are appearing to be the outliers # Histogram of variable LoanAmountĭf.hist(bins=50) # Box Plot for variable LoanAmount of training data setĭf.boxplot(column='LoanAmount') # Box Plot for variable LoanAmount by variable Gender of training data setĭf.boxplot(column='LoanAmount', by = 'Gender') We can see that there is no substantial different between the mean income of graduate and non-graduates. # Box Plot for variable ApplicantIncome by variable Education of training data setĭf.boxplot(column='ApplicantIncome', by = 'Education') This can be attributed to the income disparity in the society. The above Box Plot confirms the presence of a lot of outliers/extreme values. # Box Plot for variable ApplicantIncome of training data set # Box Plot for understanding the distributions and to observe the outliers. Understanding the Distribution of Numerical Variables # Get the unique values and their frequency of variable Property_Area Property_Area, Credit_History,etc.), we can look at frequency distribution to understand whether they make sense or not. Test_col = len(lumns) Understanding the various features (columns) of the dataset: # Summary of numerical variables for training data setįor the non-numerical values (e.g. # Store total number of columns in testing data set # Store total number of observation in training dataset # Reading the test dataset in a dataframe using Pandas # Reading the training dataset in a dataframe using Pandas Test and train dataset.zip # Importing Libraryįrom sklearn.preprocessing import LabelEncoder Here I have provided a data set.Īs to proceed further,We need to download Test & Train data set. The purpose of this analysis is to predict the loan eligibility process. The second one we are going to see the about algorithm used to tackle our problem. The first part is going to focus on data analysis and Data visualization. We have data of some predicted loans from history. So when there is name of some ‘Data’ there is a lot interesting for ‘Data Scientists’. I have explored dataset and found a lot interesting facts about loan prediction. This is the reason why I would like to introduce you to an analysis of this one. The dataset Loan Prediction: Machine Learning is indispensable for the beginner in Data Science, this dataset allows you to work on supervised learning, more preciously a classification problem.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |