Skip links

5 data formatting task from my last Data Science Project

Whether you’re Data Analyst, Data Engineer or Data Scientist there’s just no running away from data cleaning. It’s the essential skill that you need if you want to make a career in data. I am writing this blog to share 5 unique data formatting challenge that I faced in my last project working for my client.

Tip : If you are preparing for you python interview this might be a good practice.

  1. Creating tables from a nested json files.
  2. Extracting the month name directly from month number.
  3. Masking the confidential columns in your data.
  4. Concatenating excel sheets with sheet names as alias for identification.
  5. Formatting a complex dictionary to a required dataframe.

1. Creating tables from a nested json files

This is the sample json object that we need to start with
My first impression while staring at this problem was there must be definitely a method in pandas that can directly read json files and thank god that assumption came true. But let’s analyze this json little further as it appears to be nested.
After little exploration I could understand the structure little better

Next, let ushop on to Jupyter notebook and directly fire a query that reads a json file.
df = pd.read_json(‘test.json’)
df.head(3)
Executing above code produces the below output

Clearly, we can see there’s clean table barring the 1st columns. And if you observe closely, you will also find a mapping between the 1st column and the rest of the table. [ emp_id =1 & Id = 1] At this point, I was convinced that these might be stored as two tables one serving as a master for the employees and other(1st column) having detailed information about them.

#only working on the 1st column.
d1 = pd.read_json(‘test.json’).iloc[:,0]#creating an empty dataframe
df = pd.DataFrame()#looping through each sequence and concatenating them
for i in d1:
    df1 = pd.DataFrame(i)
    df = pd.concat([df,df1])
df

The above code upon execution will show a nice clean table which goes as follows

That completes the first table which was just contained in the 1st column. For extracting the master employee table the code goes down below.

#Skipping the first column and saving the rest of the table
d2 = pd.read_json(‘test.json’).iloc[:,1:]

For beginners, I would like to point out that this is just one small task and ideally for carrying out these kind of specific task we should develop a habit of packing our code together by writing them in functions. Down below is the small example of how to piece all things together and also make sure to avoid hard coding.

This might be less readable. The roadmap to get here should be by writing simple one liners first and then bundle them together in a function like this.
2. Extracting the month name directly from month number.

This is the task.
This is easiest of all the five tasks only if you know the right library.

import calendardf[‘Month_Name’] = df[‘Month’].apply(lambda x: calendar.month_abbr[int(x)])

The above code should be enough for you if your ‘Month’ column does not contain missing values which was not the case for me. So, I had to write a code to incorporate that. Down below is the complete version of it

df[‘Month_Name’] = df[‘Month’].fillna(0).apply(lambda x: calendar.month_abbr[int(x)] if x!= 0 else np.nan)

calendar.month_abbr[int(x)] might throw up an exception if it encounters a missing value and hence we need to impute it first and then write an else condition to restore it NaN.

3. Masking the confidential columns in your data.

Before jumping to the code I want to talk about the reason behind masking. When you work with consulting firms chances are you might have to deal with third party who may or may not have signed NDA. For people hearing NDA for the first time — Non-disclosure Agreements are an important legal framework used to protect sensitive and confidential information from being made available by the recipient of that information.
I was in a situation where we were dealing with the sample data and it involved sending it to a third party.
Now let’s jump on to the code. Here’s the sample data

find the data at : Data
Chance are you might have never heard of the package called ‘cape-privacy’
So. I would recommend you install this right away if you wish to emulate the same exercise at your end.

# pip install cape-privacy

Now this library has an array of method. It generally aims at providing data privacy and data perturbation. I will link some good resources below right at the end of this blog where you can explore more about this package.
We want to import the transformation class from this module

from cape_privacy.pandas import transformations as tfms

This class has lot of methods associated with it to use. We can check them by firing below query

tfms??

This will take you to the below output.

You can check the documentation of each one of them in details by further querying on the specific method. I’m going to use Tokenizer let’s read and understand the documentation as we are using this for the very first time.

This is just a part of the documentation. When you execute this you will see details of attributes and examples associated with the same method.
Ok. So, the example given tells us that we need to specify a parameter called ‘max_token_len’ which should tell python the length of the masked data. The other parameter ‘sceret’ allows the masked data to be reversed to their original data when the secret
key is known we which really don’t need as of now.
And here goes the final execution.

tokenize_name = tfms.Tokenizer(max_token_len = 20)
df[‘Name’] = tokenize_name(df.Name)
df

And we have our masked data.

Pheww !! Now I can safely ship the data to third party.

4. Concatenating excel sheets with sheet names as alias for identification.

If the heading does not make sense to you then just ignore it and look at the image down below.

We have data containing multiple sheets and we need to combine all of them together with sheet name as new column for identification
In actual scenario I had more than 30 sheets and hardcoding the sheet names was simply not an option. I am using pandas for 4+ years yet I was never aware of a method which directly fetches the sheet name.
This can be achieved by simple two liner

path = ‘Football_Data.xlsx’
sheetnames = pd.ExcelFile(path).sheet_names
sheetnames

Output : [‘Premier League’, ‘Serie A’, ‘La Liga’]
Leveraging this output we are going to run a loop and add sheets one after another by using
pandas.concat as we saw in the 1st task in the blog

df = pd.DataFrame()
for name in sheetnames:
    df1 = pd.read_excel(‘Football_Data.xlsx’,sheet_name = name)
    df1[‘Type’] = name
    df = pd.concat([df,df1])
df.reset_index(drop=True,inplace=True)

The resultant dataframe.

I never though this would be as simple as this.

5. Formatting a complex list to a required dataframe.

The question goes something like this

At this point you can pause reading the blog and take up this challenge. I was interviewing someone for a senior Data Engineer role and gave this problem to test the candidate’s python skill. So if you are able to solve this in 15 mins(yes you can google) you can consider your self to be at a very good place ad I will rate you 4.5/5 in Python.
Way I approach these kind of problems is by starting from an absolute baseline by firing the below query blindly

#sliced for the first shift by looking at the target dataframe
pd.DataFrame(availability[0:7])

We are no way near the output dataframe but one realization came to my mind that we might have to go for a transposed version of this dataframe.

Now compare this intermediate dataframe to the final resultant data frame. You will realize that only thing required from this data is last row and the index needs to be renamed from ‘value’ to ‘shift 1’.

So, at this point. I need to make sure.

  1. Get rid of the rest of the rows.
  2. Renamed from ‘value’ to‘shift 1’ before transposing
  3. Get rid of the shift and day value.

Let’s do that and see the results

pd.DataFrame(availability[0:7]).rename(columns={‘value’:’shift’ + ‘1’}).iloc[:,2:].T

Now, we are somewhat close to our resultant dataframe. Next two tasks are

  1. Convert the 0,1,2… to Monday, Tuesday, Wednesday..
  2. Run a loop for all the shifts by slicing every 7 rows from the starting dataframe.

Step 1 goes as follows

import calendardf = pd.DataFrame(availability[0:7]).rename(columns={‘value’:’shift’ + ‘1’}).iloc[:,2:].T
df.columns = list(map(lambda x:calendar.day_name[int(x)],df.columns))
df

Does the above code ring any bells? 🙂

Now. Putting all of it together

df = pd.DataFrame()for i in range(0,len(availability),7):
   
    df = pd.concat([df,pd.DataFrame(availability[i:i+7]).iloc[:,2:].T])
   
df.index = list(map(lambda x:’Shift ‘+str(x+1),df.reset_index(drop=True).index.values))df.columns = list(map(lambda x:calendar.day_name[int(x)],df.columns))df

Resultant DataFrame

Conclusion

These tasks might not appear so difficult as we have libraries in place that does the job for us but if you are a beginner or an intermediate in python chances are you might have never used some of them before.

Thank for reading.

Leave a comment

Team Building

Objective

This training program will help develop shared vision/understanding for a high-performing team, determining the critical elements and individual contributions that comprise the vision in their organizations. It will also promote practice on key skills needed to address the inevitable challenges that arise in teams, notably, appreciating individual differences, communicating collaboratively, and managing conflict.

Training Methodolgy

Training program with scenario based case studies, activities & debriefing.

Key Take Aways

1. Be able to describe: 

• Characteristics of effective teams

 • Four stages of team development

 • How individual differences and roles contribute to building a strong team 

2. Have practiced skills in: 

• Supporting team development through its formative stages

 • Constructive communication

 • Conflict resolution 

3. And will have discussed: 

• A vision of their ideal team 

• Principles and behaviors to guide team performance

 • A plan for monitoring progress toward achieving their vision


Explore
Drag