This tutorial will cover the basic usage of following topics:
In this tutorial, we could only cover very basic knowledge of these topics. You can learn more of each topic from those links we provide below.
Running the following code blocks is highly recommended.
Lists are used to store data of different data types in a sequential manner. There are addresses assigned to every element of the list, which is called as Index. The index value starts from 0 and goes on until the last element called the positive index. There is also negative indexing which starts from -1 enabling you to access elements from the last to first. Let us now understand lists better with the help of an example program.
Following are the operations that we can perform on a list:
To create a list, you use the square brackets and add elements into it accordingly. If you do not pass any elements inside the square brackets, you get an empty list as the output.
my_list = [] #create empty list
print(my_list)
my_list = [1, 2, 3, 'example', 3.132] #creating list with data
print(my_list)
[] [1, 2, 3, 'example', 3.132]
Adding the elements in the list can be achieved using the append(), extend() and insert() functions.
my_list = [1, 2, 3]
print(my_list)
my_list.append([555, 12]) #add as a single element
print(my_list)
my_list.extend([234, 'more_example']) #add as different elements
print(my_list)
my_list.insert(1, 'insert_example') #add element i
print(my_list)
[1, 2, 3] [1, 2, 3, [555, 12]] [1, 2, 3, [555, 12], 234, 'more_example'] [1, 'insert_example', 2, 3, [555, 12], 234, 'more_example']
my_list = [1, 2, 3, 'example', 3.132, 10, 30]
del my_list[5] #delete element at index 5
print(my_list)
my_list.remove('example') #remove element with value
print(my_list)
a = my_list.pop(1) #pop element from list
print('Popped Element: ', a, ' List remaining: ', my_list)
my_list.clear() #empty the list
print(my_list)
[1, 2, 3, 'example', 3.132, 30] [1, 2, 3, 3.132, 30] Popped Element: 2 List remaining: [1, 3, 3.132, 30] []
Accessing elements is the same as accessing Strings in Python. You pass the index values and hence can obtain the values as needed.
my_list = [1, 2, 3, 'example', 3.132, 10, 30]
for element in my_list: #access elements one by one
print(element)
print(my_list) #access all elements
print(my_list[3]) #access index 3 element
print(my_list[0:2]) #access elements from 0 to 1 and exclude 2
print(my_list[::-1]) #access elements in reverse
1 2 3 example 3.132 10 30 [1, 2, 3, 'example', 3.132, 10, 30] example [1, 2] [30, 10, 3.132, 'example', 3, 2, 1]
You have several other functions that can be used when working with lists.
my_list = [1, 2, 3, 10, 30, 10]
print(len(my_list)) #find length of list
print(my_list.index(10)) #find index of element that occurs first
print(my_list.count(10)) #find count of the element
print(sorted(my_list)) #print sorted list but not change original
my_list.sort(reverse=True) #sort original list
print(my_list)
6 3 2 [1, 2, 3, 10, 10, 30] [30, 10, 10, 3, 2, 1]
Dictionaries are used to store key-value pairs. To understand better, think of a phone directory where hundreds and thousands of names and their corresponding numbers have been added. Now the constant values here are Name and the Phone Numbers which are called as the keys. And the various names and phone numbers are the values that have been fed to the keys. If you access the values of the keys, you will obtain all the names and phone numbers. So that is what a key-value pair is. And in Python, this structure is stored using Dictionaries. Let us understand this better with an example program.
Followings are basic operations of dictionaries:
Dictionaries can be created using the flower braces or using the dict() function. You need to add the key-value pairs whenever you work with dictionaries.
my_dict = {} #empty dictionary
print(my_dict)
my_dict = {1: 'Python', 2: 'Java'} #dictionary with elements
print(my_dict)
{} {1: 'Python', 2: 'Java'}
To change the values of the dictionary, you need to do that using the keys. So, you firstly access the key and then change the value accordingly. To add values, you simply just add another key-value pair as shown below.
my_dict = {'First': 'Python', 'Second': 'Java'}
print(my_dict)
my_dict['Second'] = 'C++' #changing element
print(my_dict)
my_dict['Third'] = 'Ruby' #adding key-value pair
print(my_dict)
{'First': 'Python', 'Second': 'Java'} {'First': 'Python', 'Second': 'C++'} {'First': 'Python', 'Second': 'C++', 'Third': 'Ruby'}
my_dict = {'First': 'Python', 'Second': 'Java', 'Third': 'Ruby'}
a = my_dict.pop('Third') #pop element
print('Value:', a)
print('Dictionary:', my_dict)
b = my_dict.popitem() #pop the key-value pair
print('Key, value pair:', b)
print('Dictionary', my_dict)
my_dict.clear() #empty dictionary
print('n', my_dict)
Value: Ruby Dictionary: {'First': 'Python', 'Second': 'Java'} Key, value pair: ('Second', 'Java') Dictionary {'First': 'Python'} n {}
You can access elements using the keys only. You can use either the get() function or just pass the key values and you will be retrieving the values.
my_dict = {'First': 'Python', 'Second': 'Java'}
print(my_dict['First']) #access elements using keys
print(my_dict.get('Second'))
Python Java
You have different functions which return to us the keys or the values of the key-value pair accordingly to the keys(), values(), items() functions accordingly.
my_dict = {'First': 'Python', 'Second': 'Java', 'Third': 'Ruby'}
print(my_dict.keys()) #get keys
print(my_dict.values()) #get values
print(my_dict.items()) #get key-value pairs
print(my_dict.get('First'))
dict_keys(['First', 'Second', 'Third']) dict_values(['Python', 'Java', 'Ruby']) dict_items([('First', 'Python'), ('Second', 'Java'), ('Third', 'Ruby')]) Python
Tuples are the same as lists are with the exception that the data once entered into the tuple cannot be changed no matter what. The only exception is when the data inside the tuple is mutable, only then the tuple data can be changed. The example program will help you understand better.
You create a tuple using parenthesis or using the tuple() function.
my_tuple = (1, 2, 3) #create tuple
print(my_tuple)
(1, 2, 3)
Accessing elements is the same as it is for accessing values in lists.
my_tuple2 = (1, 2, 3, 'edureka') #access elements
for x in my_tuple2:
print(x)
print(my_tuple2)
print(my_tuple2[0])
print(my_tuple2[:])
print(my_tuple2[3][4])
1 2 3 edureka (1, 2, 3, 'edureka') 1 (1, 2, 3, 'edureka') e
To append the values, you use the ‘+’ operator which will take another tuple to be appended to it.
my_tuple = (1, 2, 3)
my_tuple = my_tuple + (4, 5, 6) #add elements
print(my_tuple)
(1, 2, 3, 4, 5, 6)
These functions are the same as they are for lists.
my_tuple = (1, 2, 3, ['hindi', 'python'])
my_tuple[3][0] = 'english'
print(my_tuple)
print(my_tuple.count(2))
print(my_tuple.index(['english', 'python']))
(1, 2, 3, ['english', 'python']) 1 3
Sets are a collection of unordered elements that are unique. Meaning that even if the data is repeated more than one time, it would be entered into the set only once. It resembles the sets that you have learnt in arithmetic. The operations also are the same as is with the arithmetic sets. An example program would help you understand better.
Sets are created using the flower braces but instead of adding key-value pairs, you just pass values to it.
my_set = {1, 2, 3, 4, 5, 5, 5} #create set
print(my_set)
{1, 2, 3, 4, 5}
To add elements, you use the add() function and pass the value to it.
my_set = {1, 2, 3}
my_set.add(4) #add element to set
print(my_set)
{1, 2, 3, 4}
The different operations on set such as union, intersection and so on are shown below.
my_set = {1, 2, 3, 4}
my_set_2 = {3, 4, 5, 6}
print(my_set.union(my_set_2), '----------', my_set | my_set_2)
print(my_set.intersection(my_set_2), '----------', my_set & my_set_2)
print(my_set.difference(my_set_2), '----------', my_set - my_set_2)
print(my_set.symmetric_difference(my_set_2), '----------', my_set ^ my_set_2)
my_set.clear()
print(my_set)
{1, 2, 3, 4, 5, 6} ---------- {1, 2, 3, 4, 5, 6} {3, 4} ---------- {3, 4} {1, 2} ---------- {1, 2} {1, 2, 5, 6} ---------- {1, 2, 5, 6} set()
Some resources:
NumPy stands for Numerical Python. It is a Python library used for working with an array. In Python, we use the list for purpose of the array but it’s slow to process. NumPy array is a powerful N-dimensional array object and its use in linear algebra, Fourier transform, and random number capabilities. It provides an array object much faster than traditional Python lists.
A one-dimensional array is a type of linear array.
import numpy as np
# creating list
list = [1, 2, 3, 4]
# creating numpy array
sample_array = np.array(list)
print("List in python : ", list)
print("Numpy Array in python :",
sample_array)
List in python : [1, 2, 3, 4] Numpy Array in python : [1 2 3 4]
Data in multidimensional arrays are stored in tabular form.
# importing numpy module
import numpy as np
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
# creating numpy array
sample_array = np.array([list_1,
list_2,
list_3])
print("Numpy multi dimensional array in python\n",
sample_array)
Numpy multi dimensional array in python [[ 1 2 3 4] [ 5 6 7 8] [ 9 10 11 12]]
# importing numpy module
import numpy as np
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
# creating numpy array
sample_array = np.array([list_1,
list_2,
list_3])
print("Numpy array :")
print(sample_array)
# print shape of the array
print("Shape of the array :",
sample_array.shape)
sample_array2 = np.array([[0, 4, 2],
[3, 4, 5],
[23, 4, 5],
[2, 34, 5],
[5, 6, 7]])
print(sample_array2)
print("shape of the array :",
sample_array2.shape)
Numpy array : [[ 1 2 3 4] [ 5 6 7 8] [ 9 10 11 12]] Shape of the array : (3, 4) [[ 0 4 2] [ 3 4 5] [23 4 5] [ 2 34 5] [ 5 6 7]] shape of the array : (5, 3)
Data type objects (dtype) is an instance of numpy.dtype class. It describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted.
# Import module
import numpy as np
# Creating the array
sample_array_1 = np.array([[0, 4, 2]])
sample_array_2 = np.array([0.2, 0.4, 2.4])
# display data type
print("Data type of the array 1 :",
sample_array_1.dtype)
print("Data type of array 2 :",
sample_array_2.dtype)
Data type of the array 1 : int64 Data type of array 2 : float64
# import module
import numpy as np
#creating a array
arr = np.array([3,4,5,5])
print("Array :",arr)
Array : [3 4 5 5]
#Import numpy module
import numpy as np
# iterable
iterable = (a*a for a in range(8))
arr = np.fromiter(iterable, float)
print("fromiter() array :", arr)
fromiter() array : [ 0. 1. 4. 9. 16. 25. 36. 49.]
import numpy as np
var = "Geekforgeeks"
arr = np.fromiter(var, dtype = 'U2')
print("fromiter() array :", arr)
fromiter() array : ['G' 'e' 'e' 'k' 'f' 'o' 'r' 'g' 'e' 'e' 'k' 's']
import numpy as np
print(np.arange(1, 20 , 2, dtype = np.float32))
[ 1. 3. 5. 7. 9. 11. 13. 15. 17. 19.]
import numpy as np
print(np.linspace(3.5, 10, 3))
[ 3.5 6.75 10. ]
import numpy as np
print(np.empty([4, 3], dtype = np.int32, order = 'f'))
[[4 4 4] [0 0 0] [4 4 4] [0 0 0]]
import numpy as np
print(np.ones([4, 3], dtype = np.int32, order = 'f'))
[[1 1 1] [1 1 1] [1 1 1] [1 1 1]]
import numpy as np
print(np.zeros([4, 3], dtype = np.int32, order = 'f'))
[[0 0 0] [0 0 0] [0 0 0] [0 0 0]]
Some resources:
Pandas is a python library that provides high-performance, easy-to-use data structures such as a series, Data Frame, and Panel for data analysis tools for Python programming language. Moreover, Pandas Data Frame consists of main components, the data, rows, and columns. To use the pandas' library and its data structures, all you have to do is to install it and import it. See the documentation of the Pandas library for a better understanding and installing guidance.
The pandas data frame can be created by loading the data from the external, existing storage like a database, SQL or CSV files. But the pandas Data Frame can also be created from the lists, dictionary, etc. One of the ways to create a pandas data frame is shown below:
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'Age': [24, 23, 22, 19, 10]}
data
{'Name': ['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'], 'Age': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
df
Name | Age | |
---|---|---|
0 | Ashika | 24 |
1 | Tanu | 23 |
2 | Ashwin | 22 |
3 | Mohit | 19 |
4 | Sourabh | 10 |
Data Frame is a two-dimensional data structure, data is stored in rows and columns. Below we can perform some operations on Rows and Columns.
Selecting a Column
In order to select a particular column, all we can do is just call the name of the column inside the data frame.
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'Age': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
# Selecting column
df[['Name']]
Name | |
---|---|
0 | Ashika |
1 | Tanu |
2 | Ashwin |
3 | Mohit |
4 | Sourabh |
Selecting a Row
Pandas Data Frame provides a method called "loc" which is used to retrive rows from the data frame. Also rows can also be selected by using the "iloc" as an function.
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'Age': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
row = df.loc[1]
row
Name Tanu Age 23 Name: 1, dtype: object
row = df.iloc[1, :]
row
Name Tanu Age 23 Name: 1, dtype: object
age = df.iloc[1, 1]
age
23
You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'Age': [24, 23, 22, 19, 10]}
data
{'Name': ['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'], 'Age': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
# Selecting the data from the column
df['Age']
0 24 1 23 2 22 3 19 4 10 Name: Age, dtype: int64
# Select rows which have ages > 19
df[df['Age'] > 19]
Name | Age | |
---|---|---|
0 | Ashika | 24 |
1 | Tanu | 23 |
2 | Ashwin | 22 |
del df['Age']
df
Name | |
---|---|
0 | Ashika |
1 | Tanu |
2 | Ashwin |
3 | Mohit |
4 | Sourabh |
df.insert(1, 'name', df['Name'])
df
Name | name | |
---|---|---|
0 | Ashika | Ashika |
1 | Tanu | Tanu |
2 | Ashwin | Ashwin |
3 | Mohit | Mohit |
4 | Sourabh | Sourabh |
Missing data occur a lot of times when we are accessing big data sets. It occurs often like NaN (Not a number). In order to fill those values we can use "isnull()" method. This method checks whether a null value is present in a data frame or not.
Checking for the missing values.
# importing both pandas and numpy libraries
import pandas as pd
import numpy as np
# Dictionary of key pair values called data
data ={'Name':['Tanu', np.nan],
'Age': [23, np.nan]}
data
{'Name': ['Tanu', nan], 'Age': [23, nan]}
df = pd.DataFrame(data)
df
Name | Age | |
---|---|---|
0 | Tanu | 23.0 |
1 | NaN | NaN |
# using the isnull() function
df.isnull()
Name | Age | |
---|---|---|
0 | False | False |
1 | True | True |
df.fillna(0)
Name | Age | |
---|---|---|
0 | Tanu | 23.0 |
1 | 0 | 0.0 |
Now we have found the missing values, next task is to fill those values with 0 this can be done as shown below:
df.fillna(0)
Name | Age | |
---|---|---|
0 | Tanu | 23.0 |
1 | 0 | 0.0 |
To give the columns or the index values of your data frame a different value, it’s best to use the .rename() method. Purposefully I have changed the column name to give a better insight
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'NAMe':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'AGe': [24, 23, 22, 19, 10]}
data
{'NAMe': ['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'], 'AGe': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
df
NAMe | AGe | |
---|---|---|
0 | Ashika | 24 |
1 | Tanu | 23 |
2 | Ashwin | 22 |
3 | Mohit | 19 |
4 | Sourabh | 10 |
newcols = {
'NAMe': 'Name',
'AGe': 'Age'
}
# Use `rename()` to rename your columns
df.rename(columns=newcols, inplace=True)
df
Name | Age | |
---|---|---|
0 | Ashika | 24 |
1 | Tanu | 23 |
2 | Ashwin | 22 |
3 | Mohit | 19 |
4 | Sourabh | 10 |
# The values of new index
newindex = {
0: 'a',
1: 'b',
2: 'c',
3: 'd',
4: 'e'
}
# Rename your index
df.rename(index=newindex)
Name | Age | |
---|---|---|
a | Ashika | 24 |
b | Tanu | 23 |
c | Ashwin | 22 |
d | Mohit | 19 |
e | Sourabh | 10 |
Write to csv
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
'Age': [24, 23, 22, 19, 10]}
df = pd.DataFrame(data)
print(df)
# Write to csv
df.to_csv('myDataFrame.csv')
# If you don't want to include row index
df.to_csv('myDataFrame_noidx.csv', index=False)
Name Age 0 Ashika 24 1 Tanu 23 2 Ashwin 22 3 Mohit 19 4 Sourabh 10
# read from csv
df = pd.read_csv('myDataFrame.csv', header=None, nrows=5)
print(df)
0 1 2 0 NaN Name Age 1 0.0 Ashika 24 2 1.0 Tanu 23 3 2.0 Ashwin 22 4 3.0 Mohit 19
# when header=None, the column names are indices
df = pd.read_csv('myDataFrame_noidx.csv', header=None, nrows=5)
print(df)
0 1 0 Name Age 1 Ashika 24 2 Tanu 23 3 Ashwin 22 4 Mohit 19
# If you don't want column names to be indices, please explicit set header=0
df = pd.read_csv('myDataFrame_noidx.csv', header=0, nrows=5)
print(df)
Name Age 0 Ashika 24 1 Tanu 23 2 Ashwin 22 3 Mohit 19 4 Sourabh 10
Sometimes there may be incorrect values in the dataframe:
import pandas as pd
import numpy as np
# There is an "A" in the "Rate" column which should be numeric value
data1 = {'id' : [1,2,3,4,5],
'Rate' : [5,9,3,'A',6],
'Name' : ['a','b','c','d','e']}
df = pd.DataFrame(data1)
print(df)
id Rate Name 0 1 5 a 1 2 9 b 2 3 3 c 3 4 A d 4 5 6 e
In this case, we can use the following command:
df['Rate'] = pd.to_numeric(df['Rate'], errors='coerce') * 2
print(df)
id Rate Name 0 1 10.0 a 1 2 18.0 b 2 3 6.0 c 3 4 NaN d 4 5 12.0 e
Some resources:
If you want to implement your learning algorithm with sci-kit-learn, the first thing you need to do is to prepare your data.
This will showcase the structure of the problem to the learning algorithm you decide to use.
There are four proven steps in the preparation of data for learning with sci-kit-learn. They include:
Rescaling the attributes of your data particularly when it consists of different scales which enables several learning algorithms to benefit from the rescaling process for data to ensure occurrence in the same scale.
This process is callable nominalization with attributes having a rescaled range of 0 and 1. It ensures the existence of optimization algorithm that forms the core of gradient descent -an exam of the learning algorithm.
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
# data values
X = [ [110, 200], [120, 800], [310, 400], [140, 900], [510, 200], [653, 400] ,[310, 880] ]
# transofrm data
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=2)
print(rescaledX[0:6,:])
[[0. 0. ] [0.02 0.86] [0.37 0.29] [0.06 1. ] [0.74 0. ] [1. 0.29]]