Missing values can appear as ‘NaN’ (Not a Number), ‘NA’ (Not Available), ‘n/a’, ‘na’, ‘?’, a blank space, an out-of-range value and in many other forms depending on the user(s) filling in the data. In real datasets, missing values are almost unavoidable and they can be caused by several reasons like corrupted data or unrecorded observations.
Learning to handle missing values in a dataset is very important because most machine learning models cannot handle missing values. In this tutorial, you’ll learn how to handle missing values using Pandas. Let’s get started!
Identifying Missing Values
I joined the third cohort of the She Code Africa Data Science Mentorship Program about three months ago, and it has been a ride! I’d like to share with you, Reader some of the many things that I learnt while on the program. I hope that you find this article both interesting and educative. Now, let’s get started. Oh, no, wait!
I appreciate sincerely, my Mentor; Miss Olufunmilayo Ruth, Aforijiku. Yeah, we get assigned a mentor to guide us through, and mine was great! She taught me the tools I am about sharing with you. God bless her kind heart.
This work was carried out as a technical project, being a member of the third cohort of the She Code Africa Data Science Mentorship program.
The dataset used is the Iris flower dataset, gotten from Kaggle. In this project, I explored and classified the species of Iris flower based on the sepal length, sepal width petal length and petal width. This work was carried out in three parts: Exploratory Data Analysis, Predictive Modelling and Data Web Application Building. Let’s go!
Exploratory Data Analysis
The data has 150 rows and 6 columns. The columns are Id, sepal length, sepal width, petal…
This project was carried out as an assignment, being a member of the third cohort of the She Code Africa Data Science Mentorship Program.
Here, I explore the rate of unemployment in the USA over every month for 27 years; 1990–2016, in 47 states of the USA consisting of 1,752 counties. This analysis is done in two parts namely: Analysis of Time(Month and Year) and Analysis of Place/Location (State and County).
The data used for this analysis was gotten from the Bureau of Labor Statistics, US Department of Labor. It has 885,548 rows and 5 columns.
The Password generator Python program generates a random password for the user. The password generated would be a mixture of upper and lower case letters, symbols and numbers. The program gets information from the user on how long they want their password to be and how many letters and numbers they want in their password which must have atleast 6 characters. In this article, I have explained the Password generator Python program.
We start by importing the needed libraries and submodules:
import numpy as np
from numpy.random import choice,shuffle
import pandas as pd
From the string…
Python is the most commonly used and one of the best programming languages for data science. This is because Python is an easy to use language and it has many open-source libraries that can be used for data science. Since Python is open-source, it is free and has an active community which makes it regularly updated. Here, I have discussed some of the top Python libraries for data science. I have divided these libraries into two: those used for data processing and modelling, and those used for data visualizations. Let’s go!
Libraries Used for Data Processing and Modelling
Physicist and Data Scientist