A Tutorial about Python Basics¶

This tutorial will cover the basic usage of following topics:

  • Basic data structures
  • Numpy
  • Pandas
  • Basic data processing for sklearn

In this tutorial, we could only cover very basic knowledge of these topics. You can learn more of each topic from those links we provide below.

Running the following code blocks is highly recommended.


Basic data structures (resource)¶

  • Lists
  • Dictionaries
  • Tuples
  • Sets

1) Lists¶

Lists are used to store data of different data types in a sequential manner. There are addresses assigned to every element of the list, which is called as Index. The index value starts from 0 and goes on until the last element called the positive index. There is also negative indexing which starts from -1 enabling you to access elements from the last to first. Let us now understand lists better with the help of an example program.

Following are the operations that we can perform on a list:

  1. Creating a list
  2. Adding elements
  3. Deleting elements
  4. Accessing elements
  5. Some other functions

Creating a list¶

To create a list, you use the square brackets and add elements into it accordingly. If you do not pass any elements inside the square brackets, you get an empty list as the output.

In [116]:
my_list = [] #create empty list
print(my_list)
my_list = [1, 2, 3, 'example', 3.132] #creating list with data
print(my_list)
[]
[1, 2, 3, 'example', 3.132]

Adding elements¶

Adding the elements in the list can be achieved using the append(), extend() and insert() functions.

  • The append() function adds all the elements passed to it as a single element.
  • The extend() function adds the elements one-by-one into the list.
  • The insert() function adds the element passed to the index value and increase the size of the list too.
In [117]:
my_list = [1, 2, 3]
print(my_list)
my_list.append([555, 12]) #add as a single element
print(my_list)
my_list.extend([234, 'more_example']) #add as different elements
print(my_list)
my_list.insert(1, 'insert_example') #add element i
print(my_list)
[1, 2, 3]
[1, 2, 3, [555, 12]]
[1, 2, 3, [555, 12], 234, 'more_example']
[1, 'insert_example', 2, 3, [555, 12], 234, 'more_example']

Deleting Elements¶

  • To delete elements, use the del keyword which is built-in into Python but this does not return anything back to us.
  • If you want the element back, you use the pop() function which takes the index value.
  • To remove an element by its value, you use the remove() function.
In [118]:
my_list = [1, 2, 3, 'example', 3.132, 10, 30]
del my_list[5] #delete element at index 5
print(my_list)
my_list.remove('example') #remove element with value
print(my_list)
a = my_list.pop(1) #pop element from list
print('Popped Element: ', a, ' List remaining: ', my_list)
my_list.clear() #empty the list
print(my_list)
[1, 2, 3, 'example', 3.132, 30]
[1, 2, 3, 3.132, 30]
Popped Element:  2  List remaining:  [1, 3, 3.132, 30]
[]

Accessing Elements¶

Accessing elements is the same as accessing Strings in Python. You pass the index values and hence can obtain the values as needed.

In [119]:
my_list = [1, 2, 3, 'example', 3.132, 10, 30]
for element in my_list: #access elements one by one
    print(element)
print(my_list) #access all elements
print(my_list[3]) #access index 3 element
print(my_list[0:2]) #access elements from 0 to 1 and exclude 2
print(my_list[::-1]) #access elements in reverse
1
2
3
example
3.132
10
30
[1, 2, 3, 'example', 3.132, 10, 30]
example
[1, 2]
[30, 10, 3.132, 'example', 3, 2, 1]

Other Functions¶

You have several other functions that can be used when working with lists.

  • The len() function returns to us the length of the list.
  • The index() function finds the index value of value passed where it has been encountered the first time.
  • The count() function finds the count of the value passed to it.
  • The sorted() and sort() functions do the same thing, that is to sort the values of the list. The sorted() has a return type whereas the sort() modifies the original list.
In [120]:
my_list = [1, 2, 3, 10, 30, 10]
print(len(my_list)) #find length of list
print(my_list.index(10)) #find index of element that occurs first
print(my_list.count(10)) #find count of the element
print(sorted(my_list)) #print sorted list but not change original
my_list.sort(reverse=True) #sort original list
print(my_list)
6
3
2
[1, 2, 3, 10, 10, 30]
[30, 10, 10, 3, 2, 1]

2) Dictionaries¶

Dictionaries are used to store key-value pairs. To understand better, think of a phone directory where hundreds and thousands of names and their corresponding numbers have been added. Now the constant values here are Name and the Phone Numbers which are called as the keys. And the various names and phone numbers are the values that have been fed to the keys. If you access the values of the keys, you will obtain all the names and phone numbers. So that is what a key-value pair is. And in Python, this structure is stored using Dictionaries. Let us understand this better with an example program.

Followings are basic operations of dictionaries:

  1. Creating a dictionary
  2. Changing and Adding key, value pairs
  3. Deleting key, value pairs
  4. Accessing Elements
  5. Some other functions

Creating a Dictionary¶

Dictionaries can be created using the flower braces or using the dict() function. You need to add the key-value pairs whenever you work with dictionaries.

In [121]:
my_dict = {} #empty dictionary
print(my_dict)
my_dict = {1: 'Python', 2: 'Java'} #dictionary with elements
print(my_dict)
{}
{1: 'Python', 2: 'Java'}

Changing and Adding key, value pairs¶

To change the values of the dictionary, you need to do that using the keys. So, you firstly access the key and then change the value accordingly. To add values, you simply just add another key-value pair as shown below.

In [122]:
my_dict = {'First': 'Python', 'Second': 'Java'}
print(my_dict)
my_dict['Second'] = 'C++' #changing element
print(my_dict)
my_dict['Third'] = 'Ruby' #adding key-value pair
print(my_dict)
{'First': 'Python', 'Second': 'Java'}
{'First': 'Python', 'Second': 'C++'}
{'First': 'Python', 'Second': 'C++', 'Third': 'Ruby'}

Deleting key, value pairs¶

  • To delete the values, you use the pop() function which returns the value that has been deleted.
  • To retrieve the key-value pair, you use the popitem() function which returns a tuple of the key and value.
  • To clear the entire dictionary, you use the clear() function.
In [123]:
my_dict = {'First': 'Python', 'Second': 'Java', 'Third': 'Ruby'}
a = my_dict.pop('Third') #pop element
print('Value:', a)
print('Dictionary:', my_dict)
b = my_dict.popitem() #pop the key-value pair
print('Key, value pair:', b)
print('Dictionary', my_dict)
my_dict.clear() #empty dictionary
print('n', my_dict)
Value: Ruby
Dictionary: {'First': 'Python', 'Second': 'Java'}
Key, value pair: ('Second', 'Java')
Dictionary {'First': 'Python'}
n {}

Accessing Elements¶

You can access elements using the keys only. You can use either the get() function or just pass the key values and you will be retrieving the values.

In [124]:
my_dict = {'First': 'Python', 'Second': 'Java'}
print(my_dict['First']) #access elements using keys
print(my_dict.get('Second'))
Python
Java

Other Functions¶

You have different functions which return to us the keys or the values of the key-value pair accordingly to the keys(), values(), items() functions accordingly.

In [125]:
my_dict = {'First': 'Python', 'Second': 'Java', 'Third': 'Ruby'}
print(my_dict.keys()) #get keys
print(my_dict.values()) #get values
print(my_dict.items()) #get key-value pairs
print(my_dict.get('First'))
dict_keys(['First', 'Second', 'Third'])
dict_values(['Python', 'Java', 'Ruby'])
dict_items([('First', 'Python'), ('Second', 'Java'), ('Third', 'Ruby')])
Python

3) Tuples¶

Tuples are the same as lists are with the exception that the data once entered into the tuple cannot be changed no matter what. The only exception is when the data inside the tuple is mutable, only then the tuple data can be changed. The example program will help you understand better.

Creating a Tuple¶

You create a tuple using parenthesis or using the tuple() function.

In [126]:
my_tuple = (1, 2, 3) #create tuple
print(my_tuple) 
(1, 2, 3)

Accessing Elements¶

Accessing elements is the same as it is for accessing values in lists.

In [127]:
my_tuple2 = (1, 2, 3, 'edureka') #access elements
for x in my_tuple2:
    print(x)
print(my_tuple2)
print(my_tuple2[0])
print(my_tuple2[:])
print(my_tuple2[3][4])
1
2
3
edureka
(1, 2, 3, 'edureka')
1
(1, 2, 3, 'edureka')
e

Appending Elements¶

To append the values, you use the ‘+’ operator which will take another tuple to be appended to it.

In [128]:
my_tuple = (1, 2, 3)
my_tuple = my_tuple + (4, 5, 6) #add elements
print(my_tuple)
(1, 2, 3, 4, 5, 6)

Other Functions¶

These functions are the same as they are for lists.

In [129]:
my_tuple = (1, 2, 3, ['hindi', 'python'])
my_tuple[3][0] = 'english'
print(my_tuple)
print(my_tuple.count(2))
print(my_tuple.index(['english', 'python']))
(1, 2, 3, ['english', 'python'])
1
3

4) Sets¶

Sets are a collection of unordered elements that are unique. Meaning that even if the data is repeated more than one time, it would be entered into the set only once. It resembles the sets that you have learnt in arithmetic. The operations also are the same as is with the arithmetic sets. An example program would help you understand better.

Creating a set¶

Sets are created using the flower braces but instead of adding key-value pairs, you just pass values to it.

In [130]:
my_set = {1, 2, 3, 4, 5, 5, 5} #create set
print(my_set)
{1, 2, 3, 4, 5}

Adding elements¶

To add elements, you use the add() function and pass the value to it.

In [131]:
my_set = {1, 2, 3}
my_set.add(4) #add element to set
print(my_set)
{1, 2, 3, 4}

Operations in sets¶

The different operations on set such as union, intersection and so on are shown below.

In [132]:
my_set = {1, 2, 3, 4}
my_set_2 = {3, 4, 5, 6}
print(my_set.union(my_set_2), '----------', my_set | my_set_2)
print(my_set.intersection(my_set_2), '----------', my_set & my_set_2)
print(my_set.difference(my_set_2), '----------', my_set - my_set_2)
print(my_set.symmetric_difference(my_set_2), '----------', my_set ^ my_set_2)
my_set.clear()
print(my_set)
{1, 2, 3, 4, 5, 6} ---------- {1, 2, 3, 4, 5, 6}
{3, 4} ---------- {3, 4}
{1, 2} ---------- {1, 2}
{1, 2, 5, 6} ---------- {1, 2, 5, 6}
set()

Basics of Numpy¶

Some resources:

  • https://numpy.org/doc/stable/user/basics.html
  • https://www.geeksforgeeks.org/basics-of-numpy-arrays/

NumPy stands for Numerical Python. It is a Python library used for working with an array. In Python, we use the list for purpose of the array but it’s slow to process. NumPy array is a powerful N-dimensional array object and its use in linear algebra, Fourier transform, and random number capabilities. It provides an array object much faster than traditional Python lists.

One Dimensional Array¶

A one-dimensional array is a type of linear array.

In [133]:
import numpy as np
 
# creating list
list = [1, 2, 3, 4]
 
# creating numpy array
sample_array = np.array(list)
print("List in python : ", list)
print("Numpy Array in python :",
      sample_array)
List in python :  [1, 2, 3, 4]
Numpy Array in python : [1 2 3 4]

Multi-Dimensional Array¶

Data in multidimensional arrays are stored in tabular form.

In [134]:
# importing numpy module
import numpy as np
 
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
 
# creating numpy array
sample_array = np.array([list_1,
                         list_2,
                         list_3])
 
print("Numpy multi dimensional array in python\n",
      sample_array)
Numpy multi dimensional array in python
 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

Anatomy of an array¶

  1. Axis: The Axis of an array describes the order of the indexing into the array.
  2. Shape: The number of elements along with each axis. It is from a tuple.
In [135]:
# importing numpy module
import numpy as np
 
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
 
# creating numpy array
sample_array = np.array([list_1,
                         list_2,
                         list_3])
 
print("Numpy array :")
print(sample_array)
 
# print shape of the array
print("Shape of the array :",
      sample_array.shape)

sample_array2 = np.array([[0, 4, 2],
                       [3, 4, 5],
                       [23, 4, 5],
                       [2, 34, 5],
                       [5, 6, 7]])
 
print(sample_array2)
print("shape of the array :",
      sample_array2.shape)
Numpy array :
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
Shape of the array : (3, 4)
[[ 0  4  2]
 [ 3  4  5]
 [23  4  5]
 [ 2 34  5]
 [ 5  6  7]]
shape of the array : (5, 3)

Data type objects (dtype):¶

Data type objects (dtype) is an instance of numpy.dtype class. It describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted.

In [136]:
# Import module
import numpy as np
 
# Creating the array
sample_array_1 = np.array([[0, 4, 2]])
 
sample_array_2 = np.array([0.2, 0.4, 2.4])
 
# display data type
print("Data type of the array 1 :",
      sample_array_1.dtype)
 
print("Data type of array 2 :",
      sample_array_2.dtype)
Data type of the array 1 : int64
Data type of array 2 : float64

Some different way of creating Numpy Array :¶

  1. numpy.array(): The Numpy array object in Numpy is called ndarray. We can create ndarray using numpy.array() function.
In [137]:
# import module
import numpy as np
 
#creating a array
arr = np.array([3,4,5,5])
print("Array :",arr)
Array : [3 4 5 5]
  1. numpy.fromiter(): The fromiter() function create a new one-dimensional array from an iterable object.
In [138]:
#Import numpy module
import numpy as np
 
# iterable
iterable = (a*a for a in range(8))
arr = np.fromiter(iterable, float)
print("fromiter() array :", arr)
fromiter() array : [ 0.  1.  4.  9. 16. 25. 36. 49.]
In [139]:
import numpy as np
var = "Geekforgeeks"
arr = np.fromiter(var, dtype = 'U2')
print("fromiter() array :", arr)
fromiter() array : ['G' 'e' 'e' 'k' 'f' 'o' 'r' 'g' 'e' 'e' 'k' 's']
  1. numpy.arange(): This is an inbuilt NumPy function that returns evenly spaced values within a given interval.
In [140]:
import numpy as np
print(np.arange(1, 20 , 2, dtype = np.float32))
[ 1.  3.  5.  7.  9. 11. 13. 15. 17. 19.]
  1. numpy.linspace(): This function returns evenly spaced numbers over a specified between two limits.
In [141]:
import numpy as np
print(np.linspace(3.5, 10, 3))
[ 3.5   6.75 10.  ]
  1. numpy.empty(): This function create a new array of given shape and type, without initializing value.
In [142]:
import numpy as np
print(np.empty([4, 3], dtype = np.int32, order = 'f'))
[[4 4 4]
 [0 0 0]
 [4 4 4]
 [0 0 0]]
  1. numpy.ones(): This function is used to get a new array of given shape and type, filled with ones(1).
In [143]:
import numpy as np
print(np.ones([4, 3], dtype = np.int32, order = 'f'))
[[1 1 1]
 [1 1 1]
 [1 1 1]
 [1 1 1]]
  1. numpy.zeros(): This function is used to get a new array of given shape and type, filled with zeros(0).
In [144]:
import numpy as np
print(np.zeros([4, 3], dtype = np.int32, order = 'f'))
[[0 0 0]
 [0 0 0]
 [0 0 0]
 [0 0 0]]

Basics of Pandas¶

Some resources:

  1. https://towardsdatascience.com/python-pandas-data-frame-basics-b5cfbcd8c039
  2. https://pandas.pydata.org/docs/user_guide/10min.html

Pandas is a python library that provides high-performance, easy-to-use data structures such as a series, Data Frame, and Panel for data analysis tools for Python programming language. Moreover, Pandas Data Frame consists of main components, the data, rows, and columns. To use the pandas' library and its data structures, all you have to do is to install it and import it. See the documentation of the Pandas library for a better understanding and installing guidance.

Basic operations that can be appilied on a pandas Data Frame are as shown below:¶

  1. Creating a Data Frame.
  2. Performing operations on Rows and Columns.
  3. Data Selection, addition, deletion.
  4. Working with missing data.
  5. Renaming the columns or indices of a DataFrame.
  6. Read/write to csv
  7. Handle exceptions

Creating a pandas DataFrame¶

The pandas data frame can be created by loading the data from the external, existing storage like a database, SQL or CSV files. But the pandas Data Frame can also be created from the lists, dictionary, etc. One of the ways to create a pandas data frame is shown below:

In [145]:
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
        'Age': [24, 23, 22, 19, 10]}
data
Out[145]:
{'Name': ['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
 'Age': [24, 23, 22, 19, 10]}
In [146]:
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
df
Out[146]:
Name Age
0 Ashika 24
1 Tanu 23
2 Ashwin 22
3 Mohit 19
4 Sourabh 10

Performing operations on Rows and Columns¶

Data Frame is a two-dimensional data structure, data is stored in rows and columns. Below we can perform some operations on Rows and Columns.

Selecting a Column

In order to select a particular column, all we can do is just call the name of the column inside the data frame.

In [147]:
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
        'Age': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
# Selecting column
df[['Name']]
Out[147]:
Name
0 Ashika
1 Tanu
2 Ashwin
3 Mohit
4 Sourabh

Selecting a Row

Pandas Data Frame provides a method called "loc" which is used to retrive rows from the data frame. Also rows can also be selected by using the "iloc" as an function.

In [148]:
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
        'Age': [24, 23, 22, 19, 10]}
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
row = df.loc[1]
row
Out[148]:
Name    Tanu
Age       23
Name: 1, dtype: object
In [149]:
row = df.iloc[1, :]
row
Out[149]:
Name    Tanu
Age       23
Name: 1, dtype: object
In [150]:
age = df.iloc[1, 1]
age
Out[150]:
23

Data Selection, addition, deletion¶

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [151]:
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
        'Age': [24, 23, 22, 19, 10]}
data
Out[151]:
{'Name': ['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
 'Age': [24, 23, 22, 19, 10]}
In [152]:
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
# Selecting the data from the column
df['Age']
Out[152]:
0    24
1    23
2    22
3    19
4    10
Name: Age, dtype: int64
In [153]:
# Select rows which have ages > 19
df[df['Age'] > 19]
Out[153]:
Name Age
0 Ashika 24
1 Tanu 23
2 Ashwin 22
In [154]:
del df['Age']
df
Out[154]:
Name
0 Ashika
1 Tanu
2 Ashwin
3 Mohit
4 Sourabh
In [155]:
df.insert(1, 'name', df['Name'])
df
Out[155]:
Name name
0 Ashika Ashika
1 Tanu Tanu
2 Ashwin Ashwin
3 Mohit Mohit
4 Sourabh Sourabh

Working with missing data¶

Missing data occur a lot of times when we are accessing big data sets. It occurs often like NaN (Not a number). In order to fill those values we can use "isnull()" method. This method checks whether a null value is present in a data frame or not.

Checking for the missing values.

In [156]:
# importing both pandas and numpy libraries
import pandas as pd
import numpy as np

# Dictionary of key pair values called data
data ={'Name':['Tanu', np.nan],
       'Age': [23, np.nan]}
data
Out[156]:
{'Name': ['Tanu', nan], 'Age': [23, nan]}
In [157]:
df = pd.DataFrame(data)
df
Out[157]:
Name Age
0 Tanu 23.0
1 NaN NaN
In [158]:
# using the isnull() function
df.isnull()
Out[158]:
Name Age
0 False False
1 True True
In [159]:
df.fillna(0)
Out[159]:
Name Age
0 Tanu 23.0
1 0 0.0

Now we have found the missing values, next task is to fill those values with 0 this can be done as shown below:

In [160]:
df.fillna(0)
Out[160]:
Name Age
0 Tanu 23.0
1 0 0.0

Renaming the Columns or Indices of a DataFrame¶

To give the columns or the index values of your data frame a different value, it’s best to use the .rename() method. Purposefully I have changed the column name to give a better insight

In [161]:
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'NAMe':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
        'AGe': [24, 23, 22, 19, 10]}
data
Out[161]:
{'NAMe': ['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
 'AGe': [24, 23, 22, 19, 10]}
In [162]:
# Calling the pandas data frame method by passing the dictionary (data) as a parameter
df = pd.DataFrame(data)
df
Out[162]:
NAMe AGe
0 Ashika 24
1 Tanu 23
2 Ashwin 22
3 Mohit 19
4 Sourabh 10
In [163]:
newcols = {
            'NAMe': 'Name',
            'AGe': 'Age'
          }
# Use `rename()` to rename your columns
df.rename(columns=newcols, inplace=True)
df
Out[163]:
Name Age
0 Ashika 24
1 Tanu 23
2 Ashwin 22
3 Mohit 19
4 Sourabh 10
In [164]:
# The values of new index
newindex = {
            0: 'a',
            1: 'b',
            2: 'c',
            3: 'd',
            4: 'e'
}
# Rename your index
df.rename(index=newindex)
Out[164]:
Name Age
a Ashika 24
b Tanu 23
c Ashwin 22
d Mohit 19
e Sourabh 10

Read from and write to CSV file¶

Write to csv

In [176]:
# import the pandas library
import pandas as pd
# Dictionary of key pair values called data
data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'],
        'Age': [24, 23, 22, 19, 10]}

df = pd.DataFrame(data)
print(df)
# Write to csv
df.to_csv('myDataFrame.csv')

# If you don't want to include row index
df.to_csv('myDataFrame_noidx.csv', index=False)
      Name  Age
0   Ashika   24
1     Tanu   23
2   Ashwin   22
3    Mohit   19
4  Sourabh   10
In [177]:
# read from csv
df = pd.read_csv('myDataFrame.csv', header=None, nrows=5)
print(df)
     0       1    2
0  NaN    Name  Age
1  0.0  Ashika   24
2  1.0    Tanu   23
3  2.0  Ashwin   22
4  3.0   Mohit   19
In [178]:
# when header=None, the column names are indices
df = pd.read_csv('myDataFrame_noidx.csv', header=None, nrows=5)
print(df)
        0    1
0    Name  Age
1  Ashika   24
2    Tanu   23
3  Ashwin   22
4   Mohit   19
In [179]:
# If you don't want column names to be indices, please explicit set header=0
df = pd.read_csv('myDataFrame_noidx.csv', header=0, nrows=5)
print(df)
      Name  Age
0   Ashika   24
1     Tanu   23
2   Ashwin   22
3    Mohit   19
4  Sourabh   10

Handle numeric exceptions¶

Sometimes there may be incorrect values in the dataframe:

In [180]:
import pandas as pd
import numpy as np

# There is an "A" in the "Rate" column which should be numeric value
data1 = {'id'   : [1,2,3,4,5],
         'Rate' : [5,9,3,'A',6],
         'Name' : ['a','b','c','d','e']}

df = pd.DataFrame(data1)

print(df)
   id Rate Name
0   1    5    a
1   2    9    b
2   3    3    c
3   4    A    d
4   5    6    e

In this case, we can use the following command:

In [181]:
df['Rate'] = pd.to_numeric(df['Rate'], errors='coerce') * 2
print(df)
   id  Rate Name
0   1  10.0    a
1   2  18.0    b
2   3   6.0    c
3   4   NaN    d
4   5  12.0    e

Basic data processing with sklearn¶

Some resources:

  1. https://scikit-learn.org/stable/modules/preprocessing.html (official site, highly recommended if you want to learn more)
  2. https://pythonbasics.org/how-to-prepare-your-data-for-machine-learning-with-scikit-learn/

If you want to implement your learning algorithm with sci-kit-learn, the first thing you need to do is to prepare your data.

This will showcase the structure of the problem to the learning algorithm you decide to use.

There are four proven steps in the preparation of data for learning with sci-kit-learn. They include:

  • Rescale the data
  • Standardization of data
  • Normalize the data
  • Turn data into binary

Rescale the data¶

Rescaling the attributes of your data particularly when it consists of different scales which enables several learning algorithms to benefit from the rescaling process for data to ensure occurrence in the same scale.

This process is callable nominalization with attributes having a rescaled range of 0 and 1. It ensures the existence of optimization algorithm that forms the core of gradient descent -an exam of the learning algorithm.

In [1]:
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler

# data values
X = [ [110, 200], [120, 800], [310, 400], [140, 900], [510, 200], [653, 400] ,[310, 880] ]

# transofrm data
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
numpy.set_printoptions(precision=2)
print(rescaledX[0:6,:])
[[0.   0.  ]
 [0.02 0.86]
 [0.37 0.29]
 [0.06 1.  ]
 [0.74 0.  ]
 [1.   0.29]]
In [ ]: