PANDAS-Getting started with data analytics.

Let’s get started

gaurav kumar
3 min readMar 4, 2021

Hello everyone! I want to discuss pandas,one of the most important and widely used python library. Though are are lot of stories available , but the importance of this particular library got me to write this. This is the one library from where the journey of data science begins.

As more and more people are getting into Data science and Machine learning , it is very important for them to have the right tools.

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python

Importing Pandas

it is installed generally with another important python library called Numpy as:

import pandas as pdimport numpy as np

Reading and writing data

Using pandas we can with various different file format, for example:

CSV Files
EXCEL Files
JSON Files
SQL Files
HTML Files
PICKEL fILES

Most of the time we use , only two type of file , either CSV(comma separated value) file or Excel file.

let’s see how to read and write these two file.

1. CSV

read_csv is used to read files , while to_csv is used to write files.

df=pd.read_csv()
df.to_csv()

2.Excel

Similar to CSV , read_excel is used to read files,while to_excel is used to write files.

df=pd.read_excel()
df.to_excel()

3. Json

df=pd.read_json()
df.to_json()

Data structures

One of the most important things in Pandas is to understand the data structure that it has.Pandas is divided into three data structures when it comes to dimensionality of an array. These data structures are:

  • Series
  • DataFrame
  • Panel

whlie series 1D ,DataFrame is 2D and panel is 3D. Out of these three , the one most frequently used Datastructure is DataFrame.Let’s see how we can create a DataFrame.

pd.DataFrame( data, index, columns, dtype, copy)

Cleaning data

Data cleaning is the process of fixing or removing:-

  1. Null Values
  2. incorrect Data
  3. Incorrectly Formatted
  4. Duplicate Data

and so on. Pandas provide useful functions to tackle these problems.For example to remove the empty cell or Null Values , we have dropna function.

df = pd.read_csv('data.csv')

df_without_null = df.dropna()

Similarly to remove duplicates :-

df.drop_duplicates(inplace = True)

Filtering and Grouping data

We are often required to perform the filtering operations for accessing the desired data. For this pandas has very handy functions as groupby() and filter.

Let’s understand this by example:

I hope it’s clear from the example what groupby() does.

Data can be filtered as per requirements, such as here we can filter the data with age less than 10 :-

filtered_data = data['AGE']<10

Merging Data Frames

Sometimes we do need to merge two data sets to form a single one . this concept can be related to the join function of a relational database. pandas do take of all these and have an inbuilt function as merge. Let’s see hoe merge function is defined :-

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

I have tried to give a complete overview of pandas, i hope you all will find useful. If you like this story or find it useful , do let me know by clapping.

Please let me know if you find any mistakes or you want me to add anything to this by commenting.

Thank you.

--

--