Pandas Basic Tutorial

DEMO

Note: This was made by the Stackoverflow Developer Dataset.

*import pandas as pd 


df = read_csv(“file path”) # reads the CSV file 

df  #prints the entire data frame.

df.shape #gives the shape of the data frame (rows, columns)

df.info #gives the complete info about the data frame


pd.set_option(“display.max_columns”, number of columns) # displays all the columns.

pd.set_option(“display.max_rows”, number of columns) # displays all the columns.


schema_df = pd.read_csv(‘file path_schema.csv”) #gives what the each column mean

df.head() #displays first 5 rows by default 

df.head(10) # displays first 10 rows

df.tail() # displays last 5 rows 

df.tail(10) #displays last 10 rows


DATAFRAME AND SERIES


  • We can create a data frame with the help of dictionaries.


people = {“First”:[‘Corey’, ‘jane’, ‘Jhon’]

                 “Last”:[‘Schafer’, ‘Doe’, ‘Doe’]

                 “Email”: [‘Corey@gmail’, “Jane@gmail”, Jhon@gmail”]}  


  • Keys are columns, Values are rows.

  • There are two data types in pandas: Series and Data Frames

  • Data Frame: Generally, It is a 2 Dimensional data structure, In layman terms, it is the data with rows and columns.


df = pd.DataFrame(people) #Creates data frame from the dictionary


  • We can access the columns in two ways 


df.[‘email’] #Prints the particular column of the data frame, in this case it is the email column.

type(df.[‘email’]) #prints the data type of the email column 


  • It prints “pandas.core.series.Series” because it is a series data type. 

  • Series: Its 1 Dimensional array but, in layman terms It is a list of data, but has a lot more functionality than that.


df.email # prints the particular column of the data frame, in this case, email column. (other notation)

df([‘last’,’email’]) #This accesses the multiple columns of the data frame, In this case, Lastname and email. 

df.columns #prints all the column names as a list


  • We can use loc, iloc in order to access the rows 


df.iloc(0) #prints the 1st row of the data frame. 

df.iloc([0,1]) #prints the 1st, 2nd rows of the data frame. 


  • We can also access columns using iloc, as the first argument as rows and second argument as columns.

  • iloc can take only integer input like 0,1 rows or 0,1(indexes) columns. We cannot use column names here.


 df.iloc([0,1], 2) #prints the 2nd column of 1st, 2nd rows of the data frame. 


  • loc can take both column names and index input as well.


df.loc([0,1], 2) #prints the 2nd column of 1st, 2nd rows of the data frame. 

df.loc([0,1], ‘email’) #prints the ‘email’ column of 1st, 2nd rows of the data frame. 

df.loc([0,1], [‘email’, ‘last’]) #prints the ‘email’,’last’ column of 1st, 2nd rows of the data frame. 


  • It prints in the order of columns that we give.


df.shape #prints the shape of the data frame

df[‘Hobbyist’].value_count() # prints the sum of the total number of hobbyists in Yes/No groups

df.iloc(0:3, Hobbyist: Experiment) #prints the Hobbyist to Experiment columns of 0,1,2 rows


  • The experiment column name is included while printing but, 3rd row is not included in rows.


INDEXES


  • We can set the indexes whatever we want instead of having default indexes 0,1,2,....


df.set_index(‘email’) #this changes email column in the data set as indexes 


  • The above line does not affect the original data, original data will have indexes of 0,1,2,...


df.set_index(‘email’, inplace=True) #this changes email column in the data set as indexes 


  • The above line will change the indexes of the original data set into emails.

df.index #shows all the indexes in the form of a list


  • Why change the indexes? Well, a unique data index will be a good identifier.


Now 

df.loc[‘Corey@gmail’] # prints the info/row that belongs to Corey


df.reset_index(inplace =True) # Replaces the indexes with default indexes 0,1,2,..


  • When there is a unique id in data itself, we don't need special indexes given by the data frame.

  • There is another method to set desired index columns 


df=pd.read_csv(‘file path’, index_col = “column name”) # changes the default indexes to particular column name


schema_df.sort_index() #sort the data with respect to indexes

schema_df.sort_index(ascending= True) #sort the data in ascending order with respect to indexes 

schema_df.sort_index(inplace = True) #sort the data in ascending order with respect to indexes 


FILTERING


df[‘last’] == ‘Doe’ #prints the true/ false values for the ‘last’ in data frame containing Doe or not 


  • If ‘last’ is Doe it will print true, if ‘last’ is not Doe it prints false. This is a filter mask.


df(df[‘last’]==”Doe”) # print the rows that contain the last name as Doe

df(df[‘last’]==”Doe”, ‘email’) # print the emails in the rows that contains last name as Doe


df[‘last’]==”Doe” & df[‘first’]==”John”  #prints the rows with last name as Doe and first name as John

df[‘last’]==”Doe” | df[‘first’]==”John”  #prints the rows with last name as Doe or first name as John


df.loc(df[‘ConvertedComp’]>70000) #prints salaries greater than 70,000

df.loc(df[‘ConvertedComp’]>70000, [‘Country’, ‘languages’.]) # prints country and languages of the people with salaries greater than 70,000


Countries = [‘united States’, ‘India’,’ United Kingdom’, ‘Germany’, ‘Canada’]

Filt = df[‘country’].isin(Countries)

df.loc(Filt, ‘County’)

#prints countries in the data frame that matches with the countries in the list


df[‘languageworkedwith’]

Filt = df[‘languageworkedwith’].str.contains(‘Python’, na=False) #string method

df.loc(Filt,’languageworkedwith’)

# prints the languageworkedwith column, wherever there is python


Filt  #prints the True/False values 


UPDATING VALUES

  • Let’s look at how to update column names.

df= pd.DataFrame(people) #creates data frame of data 

df.columns #shows column names


  • If we want to update the column name


df.columns = [‘first_name’,’last_name’, ‘email’] #changes column names 


  • If we have to change column names into uppercase or lowercase


df.columns= [x.upper for c in df.columns] #changes all column names to uppercase

df.columns= [x.lower for c in df.columns] #changes column names to lowercase


  • I want to replace spaces with underscore “_”

  • If we use  dot “.” notation to call the names of the columns, it doesn't work if the column name has spaces


df.columns = df.columns.str(replace(‘ ‘, “_”) #replaces all spaces with underscores


  • If we want to rename a particular columns


df.rename(columns={‘first_name’:first, ‘last_name’:last}) #changes column names temporarily 

df.rename(columns={‘first_name’:first, ‘last_name’:last, inplace = True) #changes column names in data frame as well

  • Let's look at how to update values 
  • Let's say you want to change last name of Jane Doe


df.loc(2) # prints the second row.

df.loc[2]=[‘Jhon’, ‘Smith’, ‘jhonsmith@gmail’] #changes last name and email in second row 

Or 

df.loc(2, [‘last’, ‘email’]) = [‘smith’, ‘jhonsmith@gmailo] #changes last name and email in second row 


  • Pandas had indexer ‘.at’, it is used to change single value

  • We can also use .loc instead , for single values

  • Let's say you have a large data and you want to find “john doe” and change the last name, then we apply a filter and we will try to change from there


filt= (df.[‘email’]== ‘jhondoe@gmail’] #filters the data that matches with this email, since email will be unique we use that.

df.loc[filt, ‘last’] = ‘smith’ #changes the last name to smith.

  • How do you change multiple row values? let’s say you want to change all the email addresses to lowercase, Let's go through all 4 methods.

  • 1)apply  2)map  3)applymap  4)replace


df.[‘email’].str.lower() #changes emails to lowercase temporarily

df[‘email’] = df.[‘email’].str.lower() #changes all emails into lowercase in data frame as well


  • 1) apply 

apply is used to call a function on our values. Apply can work on a data frame or a series object.

Let's say we want to see the length of all email addresses.

df[‘email’].appyl(len) #calculates length of email addresses individually.

  • let's say we want to change all the email addresses into upper case.

def upper_email(email):

return email.upper()

df[‘email’].apply(upper_email) # changes all the emails into upper case temporarily not in data frame

df[‘email’] = df[‘email’].apply(upper_email) #changes email addresses permanently in the data frame.

  • Let's say you want to apply on a data frame, lets see if we apply “len” function for the entire data frame  

 df.apply(len) #it applies len  function to each and every series of data frame specifically for columns  

# it shows the number of rows in each column

  • If we want to see the number of columns in each row 

df.apply(len, axis =”columns” ) #shows number of columns in each row

  • apply can only be useful for performing operations on series objects/numbers. We cannot use it for performing operations on every single value when we are considering the entire data frame.


  • 2) apply map 

“apply map” can perform operations on every single value even if we consider the entire data frame.

It works only on data frames, it doesn’t work on Series objects.

df.applymap(len) #prints the length of each and every value in the data frame.

  • Let's say we want to change everything in the data frame into the lower case then we can use applymap

df.applymap(str.lower) #changes every value in the data frame into lowercase


  • 3) map

map method works only on series. It is used for substituting each value in series with another value

Let’s say we want to change the first names of persons in the data frame.

df[‘first’].map(“Corey”: “Chris”, “John”: “Jonny”) # changes Corey to Chris and john to jonny and maps remaining values to Nan


  • 4) replace 

If we want to replace the names of persons, without changing other names to Nan.


df[‘first’].replace (“Corey”: “Chris”, “John”: “Jonny”) # changes Corey to Chris and John to Jonny and remaining names will remain same temporarily

df[‘first’] = df[‘first’].replace (“Corey”: “Chris”, “John”: “Jonny”)  #changes permanently 

 

ADD/REMOVE COLUMNS/ROWS 


  • Let’s see how to update the columns

  • Let’s say we want to combine first and last name and create  a new column.


df[‘first’]+’ ’+ df[‘last’] # adds both the first and last name

df[‘full_name’] = df[‘first’]+’ ’+ df[‘last’] # creates new column with full_name as column name


  • We cannot use dot”.” notation while creating a column, because if we use dot notation python thinks you  are assigning an attribute to the data frame, not the column

  • Now let's look at removing columns, Since we had a full name we no need for first and last name columns 

df.drop(columns=[‘first’, ‘last’]) #deletes/removes the columns temporarily 

df.drop(columns=[‘first’, ‘last’], inplace =True) #removes the columns in data frame as well


  • If we want to split the full name into first and last names.


df[‘full_name’].str.split(‘ ‘) #splits with space as delimiter, 1st word is first_name

df([‘first’, ‘last’])=df[‘full_name’].str.split (‘ ’, expand=True) #creates column names with first, last names


  • Now, Let’s look at adding and removing rows.


df.append({‘first’: ‘Tony’}, ignoree_index=True) #appends a row with the name Tony in the first column and the remaining column values will be NaN


  • Let's see how to add two data frames 


df.append(df2, ignore_index=True) #combines the two data frames with different numbers of rows and columns, the final data frame will be sorted out by default.

df.append(df2, ignore_index=True, sort=False) # combines two data frames without sorting 


df= df.append(df2, ignore_index=True, sort=False)  #changes permanently 


  • Let's look at removing rows 


df.drop(index=4) #drops the 5th row (index =4)


  • If we want to delete the multiple rows with same last name


df.drop(index=df[df[‘last’]==’Doe’].index) # removes all the rows with last name Doe

Or 

filt=df[‘last’]==’Doe’

df.drop[index=df[filt].index) # removes all the rows with last name index


SORTING DATA


  • Let’s sort the data by the last name or descending order


df.sort_values(by= ‘last’) # sorts the data with respect to last names 

df.sort_values(by= ‘last’, ascending=False)  #sorts with respect to last name in Descending order


  • If we want to sort by both first and last names


df.sort_values(by= [‘last’, ‘first’] , ascending=False) # sorts data in descending order with respect to both first and last names temporarily 


  • If we want to sort the last name by descending order and first name is in ascending order.


df.sort_values(by= [‘last’, ‘first’] , ascending=[False, True] , inplace =True) # sorts the data in both ascending and descending order in data frame as well


  • If you want to sort the as it was in the beginning 


df.sort_index() # sorts data as it was while combined 

df.[‘last’].sort_values() #sorts last name values.


  • Let’s say you want to see the top 10 largest salaries.


df[‘ConvertedCamp’].nlargest(10) # prints the top 10 highest salaries in the data 

df.nlargest(10,’ConvertedCamp’) #shows the entire data frame with top 10 highest salaries 

df.nsmallest(10,’ConvertedCamp’) #shows the entire data frame with top 10 smallest salaries 



AGGREGATING AND GROUPING


  • Aggregation basically means, combining a piece of data into a single result like mean, median etc.

  • What might be the salary for developers who answered this survey?

  • For this we can calculate the median.


df[‘ConvertedComp’].head(5)

df[‘ConvertedComp’].median() #calculate the median of the particular row.


df.median() # finds the columns with numbers and find the median for all the columns

df.describe #shows all the aggregate values at one place

  • Mean is not a good metric to represent the data because it can be heavily affected by outliers. So Median gives somewhat reasonable outcomes.


Count = counts the non-missing rows.


  • If you are working on a social media platform, you need to see which social media account was used most


df[‘SocialMedia”] # prints the social media column

df[‘SocialMedia”].value_counts() #shows the number of users of each social media

df[‘SocialMedia”].value_counts(normalize=True) #shows output in percentages


  • “Group by” is a combination of splitting the object, applying the function, combining the results


df[‘Country’].value_counts #which place prints majority of the survey took place

country_group = df.groupby[‘Country’]. #groups the data by country names

country _group.get_group[‘UnitedStates’] #shows the data that belongs to the United States 


  • Most popular social media sites in the United States


filt=df[‘Country’]==’UnitedStates’

df.loc(filt)[‘SocialMedia’].value_counts()#prints the most popular social media in united states with number of users 

country_group[‘SocialMedia’].value_counts() #prints countrywise popular Social Media

country_group[‘SocialMedia’].value_counts(),loc[‘United States’] #prints the most popular social media in united states with number of users 


  • If the filter and this was doing the same , why do we need “groupby”? It removes the need for a filter. Instead of applying filter over and over we can use “groupby”


country_group[‘ConvertedComp’].median() #prints the median salaries with respect to all the countries 

country_group[‘ConvertedComp’].median(.loc[‘Germany’]) #prints the median salary of the data 

country_group[‘Converted Comp’].agg([‘median’,’mean’]) #prints the mean and median of the salaries with respect to countries.

country_group[‘Converted Comp’].agg([‘median’,’mean’]).loc[‘Canda’] #prints mean and median salaries of the Canada state


  • How many people in the country know to use Python?


filt = df[‘Country’] ==’India’

df.loc[filt][‘LanguageWorkedWith’] #prints the languageworkedwith column with India

df.loc[filt][‘LanguageWorkedWith’].str.contains(‘Python’) #prints True or False values whether people know python or not.


  • Sum function can also work on boolean functions, It considers 1 for True and 0 for False.

df.loc[filt][‘LanguageWorkedWith’].str.contains(‘Python’).sum() #gives the total number of people know the python as language


pd.concat() #concatenates two series


CLEANING DATA

Casting Data types and handling missing values


df.dropna() #drops the rows with missing values

df.dropna(axis= ‘index’, how= ‘any’) #default arguments of dropna

df.replace(‘NA’, np.nan,inplace =True) # replaces all the NA values to NaN values so that it can be easy to deal with missing values.

  • The above drops the rows that have null values, if we replace ‘index’ with columns it will drop the columns with null values.

  • If we change " any" to ‘all’ then it will drop the rows/columns that are filled with null values.

  • If we want to drop only a particular null rows then 

df.dropna(axis= ‘index’, how= ‘any’, subset=[‘email’]) #drops the rows that has null values in email column

df.dropna(axis= ‘index’, how= ‘any’, subset=[‘last’ , ‘email’]) #drops the rows that has null values in email column and in last name column

df.isna() # prints True or False, if the data has Na values it prints True at that position and False if not.

df.fillna(0) # Replaces all the NA values with zero.(useful when working with Numerical data)

df[‘age’] =df[‘age’].astype(int) #Converts the age columns into numbers. 

  • By default all the columns in data frame are Objects/Strings

  • astype() doesn’t work on columns with Null values, you have to convert null values to some number or you have to convert the column into ‘float’ data type.

df[‘age’] =df[‘age’].astype(float) #converts age column into float





made with 💓 by K.Lakshmi

Comments

Popular posts from this blog

Object Detection Using OpenCV and Transfer Learning.

Path Detection based on Bird Eye view with OpenCV module

Traffic signs recognition with convolutional Neural networks