Lesson 3

These tutorials are also available through an email course, please visit http://www.hedaro.com/pandas-tutorial to sign up today.

Get Data - Our data set will consist of an Excel file containing customer counts per date. We will learn how to read in the excel file for processing.
Prepare Data - The data is an irregular time series having duplicate dates. We will be challenged in compressing the data and coming up with next years forecasted customer count.
Analyze Data - We use graphs to visualize trends and spot outliers. Some built in computational tools will be used to calculate next years forecasted customer count.
Present Data - The results will be plotted.

NOTE: Make sure you have looked through all previous lessons, as the knowledge learned in previous lessons will be needed for this exercise.

In [1]:

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys
import matplotlib

%matplotlib inline

In [2]:

print('Python version ' + sys.version)
print('Pandas version: ' + pd.__version__)
print('Matplotlib version ' + matplotlib.__version__)

Python version 3.5.1 |Anaconda custom (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version: 0.20.1
Matplotlib version 1.5.1

We will be creating our own test data for analysis.

In [3]:

# set seed
np.seed(111)

# Function to generate test data
def CreateDataSet(Number=1):
    
    Output = []
    
    for i in range(Number):
        
        # Create a weekly (mondays) date range
        rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')
        
        # Create random data
        data = np.randint(low=25,high=1000,size=len(rng))
        
        # Status pool
        status = [1,2,3]
        
        # Make a random list of statuses
        random_status = [status[np.randint(low=0,high=len(status))] for i in range(len(rng))]
        
        # State pool
        states = ['GA','FL','fl','NY','NJ','TX']
        
        # Make a random list of states 
        random_states = [states[np.randint(low=0,high=len(states))] for i in range(len(rng))]
    
        Output.extend(zip(random_states, random_status, data, rng))
        
    return Output

Now that we have a function to generate our test data, lets create some data and stick it into a dataframe.

In [4]:

dataset = CreateDataSet(4)
df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','StatusDate'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 836 entries, 0 to 835
Data columns (total 4 columns):
State            836 non-null object
Status           836 non-null int64
CustomerCount    836 non-null int64
StatusDate       836 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 26.2+ KB

In [5]:

df.head()

Out[5]:

	State	Status	CustomerCount	StatusDate
0	GA	1	877	2009-01-05
1	FL	1	901	2009-01-12
2	fl	3	749	2009-01-19
3	FL	3	111	2009-01-26
4	GA	1	300	2009-02-02

We are now going to save this dataframe into an Excel file, to then bring it back to a dataframe. We simply do this to show you how to read and write to Excel files.

We do not write the index values of the dataframe to the Excel file, since they are not meant to be part of our initial test data set.

In [6]:

# Save results to excel
df.to_excel('Lesson3.xlsx', index=False)
print('Done')

Done

Grab Data from Excel

We will be using the read_excel function to read in data from an Excel file. The function allows you to read in specfic tabs by name or location.

In [7]:

pd.read_excel?

Note: The location on the Excel file will be in the same folder as the notebook, unless specified otherwise.

In [8]:

# Location of file
Location = r'C:\Users\david\notebooks\update\Lesson3.xlsx'

# Parse a specific sheet
df = pd.read_excel(Location, 0, index_col='StatusDate')
df.dtypes

Out[8]:

State            object
Status            int64
CustomerCount     int64
dtype: object

In [9]:

df.index

Out[9]:

DatetimeIndex(['2009-01-05', '2009-01-12', '2009-01-19', '2009-01-26',
               '2009-02-02', '2009-02-09', '2009-02-16', '2009-02-23',
               '2009-03-02', '2009-03-09',
               ...
               '2012-10-29', '2012-11-05', '2012-11-12', '2012-11-19',
               '2012-11-26', '2012-12-03', '2012-12-10', '2012-12-17',
               '2012-12-24', '2012-12-31'],
              dtype='datetime64[ns]', name='StatusDate', length=836, freq=None)

In [10]:

df.head()

Out[10]:

	State	Status	CustomerCount
StatusDate
2009-01-05	GA	1	877
2009-01-12	FL	1	901
2009-01-19	fl	3	749
2009-01-26	FL	3	111
2009-02-02	GA	1	300

Prepare Data

This section attempts to clean up the data for analysis.

Make sure the state column is all in upper case
Only select records where the account status is equal to "1"
Merge (NJ and NY) to NY in the state column
Remove any outliers (any odd results in the data set)

Lets take a quick look on how some of the State values are upper case and some are lower case

In [11]:

df['State'].unique()

Out[11]:

array(['GA', 'FL', 'fl', 'TX', 'NY', 'NJ'], dtype=object)

To convert all the State values to upper case we will use the upper() function and the dataframe's apply attribute. The lambda function simply will apply the upper function to each value in the State column.

In [12]:

# Clean State Column, convert to upper case
df['State'] = df.State.apply(lambda x: x.upper())

In [13]:

df['State'].unique()

Out[13]:

array(['GA', 'FL', 'TX', 'NY', 'NJ'], dtype=object)

In [14]:

# Only grab where Status == 1
mask = df['Status'] == 1
df = df[mask]

To turn the NJ states to NY we simply...

[df.State == 'NJ'] - Find all records in the State column where they are equal to NJ.
df.State[df.State == 'NJ'] = 'NY' - For all records in the State column where they are equal to NJ, replace them with NY.

In [15]:

# Convert NJ to NY
mask = df.State == 'NJ'
df['State'][mask] = 'NY'

Now we can see we have a much cleaner data set to work with.

In [16]:

df['State'].unique()

Out[16]:

array(['GA', 'FL', 'NY', 'TX'], dtype=object)

At this point we may want to graph the data to check for any outliers or inconsistencies in the data. We will be using the plot() attribute of the dataframe.

As you can see from the graph below it is not very conclusive and is probably a sign that we need to perform some more data preparation.

In [17]:

df['CustomerCount'].plot(figsize=(15,5));

Can'dan Notlar

12 Aralık 2017

Pandas Kütüphanesi Ders - 3

Lesson 3

Grab Data from Excel

Prepare Data

Hiç yorum yok:

Yorum Gönder

Detaylı ve Verimli Yapay Zeka ve Dil Modelleri Kullanım Rehberi

Bu Blogda Ara