Technical – People Decode

The concept of Fuzzy matching in Python

master — Tue, 08 Mar 2022 09:44:47 +0000

Ever wondered how spellchecks and auto-corrections in your mobile phones bails you out by automatically suggesting the words that you were just about to type?

I’ve come across so many amazing functionalities in Python but the fuzzywuzzy package in particular caught my interest. So i decided to write a blog about this topic and share it to wider network.

Down below are the sub topics of this blog :

Levenshtein Distance

Fuzzywuzzy Package

Some Examples

End Notes

Levenshtein Distance

The concept of Levenshtein Distance sometimes also called as Minimum Edit distance is a popular metric used to measure the distance between two strings. It is calculated by counting number of edits required to transform one string into another. The edits could be one of the following:

Addition of a new letter.
Removal of a letter.
Substitution

Mathematical formula behind the calculation goes by something like this.

The formula above contains too many notations and subscripts too wrap one’s head around. I will quickly explain this by simple demonstration. Consider below the two strings ‘Shallow’ and ‘Follow’.

Following are the 3 simple edits required to change the 1st string to the other.

Edit 1 : Remove the 1st letter ‘S’.

Edit 2 : Substitute ‘h’ with ‘F’.

Edit 3 : Substitute ‘a’ with ‘o’.

And there we have it the Levenshtein distance between the two strings ‘Shallow’ and ‘Hallow’ is 3.

Now, there could be multiple ways of transitioning from one word to another but Levenshtein distance chooses the smallest possible path.

I have written a code in order to calculate the Levenshtein Distance for any two given strings. I highly recommend you to go through it. You can find the links to my notebook down below. Feel free to try out some other combinations and do let know if you find this helpful in the comment section.

Levenshtein Distance Calculation in Python : Link

Also, to understand the calculations in more details and to make sense of the bizarre mathematical formula shown above do watch this video. Very well explained, very intuitive and I highly recommend you to go through it once.

Fuzzywuzzy Package

The concept of fuzzy matching is to calculate similarity between any two given strings. And this is achieved by making use of the Levenshtein Distance between the two strings.

fuzzywuzzy is an inbuilt package you find inside python which has certain functions in it which does all this calculation for us. I’m going to discuss four of them which are as follows:

fuzz.ratio()

fuzz.partial_ratio()

fuzz.token_sort_ratio()

fuzz.token_set_ration()

Note : To use this you have to first install this package. Here’s how you do it.

pip install fuzzywuzzy

Examples

You may need to restart your kernel post installation.

ratio()

from fuzzywuzzy import fuzz
fuzz.ratio(‘My name is Sreemanta’,’My name is Sreemanta Kesh’)

Output : 89

The output is indicative of the fact that both the strings are 89% similar. Let us try some more examples.

print(fuzz.ratio(‘My name is Sreemanta’,’My name is Sreemanta ‘))
print(fuzz.ratio(‘My name is Sreemanta’,’My name is Sreemanta’))

Output : 98

Output : 100

fuzz.ratio() returns 100% only when it finds exact match. The 2nd example stated is an exact match where as the 1st one differs by a space.

fuzz.ratio(‘My name is Sreemanta’,’Sreemanta name is My’)

Output : 45

This tells us that order of the string also matters while comapring.

partial_ratio()

Lets perform the same example using partial_ratio() method.

print(fuzz.partial_ratio(‘My name is Sreemanta’,’My name is Sreemanta Kesh’))
print(fuzz.partial_ratio(‘My name is Sreemanta’,’My name is Sreemanta ‘))

Output : 100

Notice that the given strings are not same in either of the two cases still it gives a 100% match. This is because partial_ratio() is just checking if either of the string is a sub string of the other. Down below is another example to confirm the same

fuzz.partial_ratio(‘New York City’,’New York’)

Output : 100

token_sort_ratio()

fuzz.token_sort_ratio(‘My name is Sreemanta’,’Sreemanta name is My ‘)

Output : 100

This result is sharply different from the one observed when same command was run with ratio() method which yielded 45% match.

fuzz.token_sort_ratio(‘My name is Sreemanta’,’sreemanta name is My ** ‘)

Output : 100

From the above two example we can conclude that

Order of the words does not matter.
It also ignores punctuation.

It follows the concept of tokenization where the strings are converted in to tokens and then they are sorted in alphabetical order and thereafter the comparison happens.

token_set_ratio()

print(fuzz.token_sort_ratio(‘My name is Sreemanta’,’Sreemanta name is My Kesh ‘))
print(fuzz.token_set_ratio(‘My name is Sreemanta’,’Sreemanta name is My Kesh ‘))

Output : 89

Output : 100

token_set_ratio() takes a more flexible approach than token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio(). Extra or same repeated words do not matter.

fuzz.token_set_ratio(‘Jingle Bells’,’ Bells Jingle Bells’)

Output : 100

End Notes

When I performed fuzzy matching exercise first time I pondered over the following two questions

How Levenshtein Distance (LD) is being used to calculate the ratio?
What is this ratio?

After doing a small research I was able to unearth the formula.

Ratio = (len(str1)+len(str2) – LD) / (len(str1)+len(str2))

Exercise for you : Try to embed this formula into my Levenshtein Notebook

There are some more functionalities present inside the fuzzywuzzy package such as WRatio(), URation etc. I urge you to explore more.

Personally I have used fuzzy matching in my organization, where the aim was to map ‘Vendor Names’ with it’s ‘Invoice Reference’ and check for duplicity by setting the fuzz.ratio() threshold to 96%. This methodology has a various other important application across the industries like spell checks as listed in the beginning, CRM applications, Bioinformatics etc.

Do comment your thoughts below.

5 data formatting task from my last Data Science Project

master — Tue, 08 Mar 2022 08:45:47 +0000

Whether you’re Data Analyst, Data Engineer or Data Scientist there’s just no running away from data cleaning. It’s the essential skill that you need if you want to make a career in data. I am writing this blog to share 5 unique data formatting challenge that I faced in my last project working for my client.

Tip : If you are preparing for you python interview this might be a good practice.

Creating tables from a nested json files.
Extracting the month name directly from month number.
Masking the confidential columns in your data.
Concatenating excel sheets with sheet names as alias for identification.
Formatting a complex dictionary to a required dataframe.

1. Creating tables from a nested json files

This is the sample json object that we need to start with
My first impression while staring at this problem was there must be definitely a method in pandas that can directly read json files and thank god that assumption came true. But let’s analyze this json little further as it appears to be nested.
After little exploration I could understand the structure little better

Next, let ushop on to Jupyter notebook and directly fire a query that reads a json file.
df = pd.read_json(‘test.json’)
df.head(3)
Executing above code produces the below output

Clearly, we can see there’s clean table barring the 1st columns. And if you observe closely, you will also find a mapping between the 1st column and the rest of the table. [ emp_id =1 & Id = 1] At this point, I was convinced that these might be stored as two tables one serving as a master for the employees and other(1st column) having detailed information about them.

#only working on the 1st column.
d1 = pd.read_json(‘test.json’).iloc[:,0]#creating an empty dataframe
df = pd.DataFrame()#looping through each sequence and concatenating them
for i in d1:
df1 = pd.DataFrame(i)
df = pd.concat([df,df1])
df

The above code upon execution will show a nice clean table which goes as follows

That completes the first table which was just contained in the 1st column. For extracting the master employee table the code goes down below.

#Skipping the first column and saving the rest of the table
d2 = pd.read_json(‘test.json’).iloc[:,1:]

For beginners, I would like to point out that this is just one small task and ideally for carrying out these kind of specific task we should develop a habit of packing our code together by writing them in functions. Down below is the small example of how to piece all things together and also make sure to avoid hard coding.

This might be less readable. The roadmap to get here should be by writing simple one liners first and then bundle them together in a function like this.
2. Extracting the month name directly from month number.

This is the task.
This is easiest of all the five tasks only if you know the right library.

import calendardf[‘Month_Name’] = df[‘Month’].apply(lambda x: calendar.month_abbr[int(x)])

The above code should be enough for you if your ‘Month’ column does not contain missing values which was not the case for me. So, I had to write a code to incorporate that. Down below is the complete version of it

df[‘Month_Name’] = df[‘Month’].fillna(0).apply(lambda x: calendar.month_abbr[int(x)] if x!= 0 else np.nan)

calendar.month_abbr[int(x)] might throw up an exception if it encounters a missing value and hence we need to impute it first and then write an else condition to restore it NaN.

3. Masking the confidential columns in your data.

Before jumping to the code I want to talk about the reason behind masking. When you work with consulting firms chances are you might have to deal with third party who may or may not have signed NDA. For people hearing NDA for the first time — Non-disclosure Agreements are an important legal framework used to protect sensitive and confidential information from being made available by the recipient of that information.
I was in a situation where we were dealing with the sample data and it involved sending it to a third party.
Now let’s jump on to the code. Here’s the sample data

find the data at : Data
Chance are you might have never heard of the package called ‘cape-privacy’
So. I would recommend you install this right away if you wish to emulate the same exercise at your end.

# pip install cape-privacy

Now this library has an array of method. It generally aims at providing data privacy and data perturbation. I will link some good resources below right at the end of this blog where you can explore more about this package.
We want to import the transformation class from this module

from cape_privacy.pandas import transformations as tfms

This class has lot of methods associated with it to use. We can check them by firing below query

tfms??

This will take you to the below output.

You can check the documentation of each one of them in details by further querying on the specific method. I’m going to use Tokenizer let’s read and understand the documentation as we are using this for the very first time.

This is just a part of the documentation. When you execute this you will see details of attributes and examples associated with the same method.
Ok. So, the example given tells us that we need to specify a parameter called ‘max_token_len’ which should tell python the length of the masked data. The other parameter ‘sceret’ allows the masked data to be reversed to their original data when the secret
key is known we which really don’t need as of now.
And here goes the final execution.

tokenize_name = tfms.Tokenizer(max_token_len = 20)
df[‘Name’] = tokenize_name(df.Name)
df

And we have our masked data.

Pheww !! Now I can safely ship the data to third party.

4. Concatenating excel sheets with sheet names as alias for identification.

If the heading does not make sense to you then just ignore it and look at the image down below.

We have data containing multiple sheets and we need to combine all of them together with sheet name as new column for identification
In actual scenario I had more than 30 sheets and hardcoding the sheet names was simply not an option. I am using pandas for 4+ years yet I was never aware of a method which directly fetches the sheet name.
This can be achieved by simple two liner

path = ‘Football_Data.xlsx’
sheetnames = pd.ExcelFile(path).sheet_names
sheetnames

Output : [‘Premier League’, ‘Serie A’, ‘La Liga’]
Leveraging this output we are going to run a loop and add sheets one after another by using
pandas.concat as we saw in the 1st task in the blog

df = pd.DataFrame()
for name in sheetnames:
    df1 = pd.read_excel(‘Football_Data.xlsx’,sheet_name = name)
    df1[‘Type’] = name
    df = pd.concat([df,df1])
df.reset_index(drop=True,inplace=True)

The resultant dataframe.

I never though this would be as simple as this.

5. Formatting a complex list to a required dataframe.

The question goes something like this

At this point you can pause reading the blog and take up this challenge. I was interviewing someone for a senior Data Engineer role and gave this problem to test the candidate’s python skill. So if you are able to solve this in 15 mins(yes you can google) you can consider your self to be at a very good place ad I will rate you 4.5/5 in Python.
Way I approach these kind of problems is by starting from an absolute baseline by firing the below query blindly

#sliced for the first shift by looking at the target dataframe
pd.DataFrame(availability[0:7])

We are no way near the output dataframe but one realization came to my mind that we might have to go for a transposed version of this dataframe.

Now compare this intermediate dataframe to the final resultant data frame. You will realize that only thing required from this data is last row and the index needs to be renamed from ‘value’ to ‘shift 1’.

So, at this point. I need to make sure.

Get rid of the rest of the rows.
Renamed from ‘value’ to‘shift 1’ before transposing
Get rid of the shift and day value.

Let’s do that and see the results

pd.DataFrame(availability[0:7]).rename(columns={‘value’:’shift’ + ‘1’}).iloc[:,2:].T

Now, we are somewhat close to our resultant dataframe. Next two tasks are

Convert the 0,1,2… to Monday, Tuesday, Wednesday..
Run a loop for all the shifts by slicing every 7 rows from the starting dataframe.

Step 1 goes as follows

import calendardf = pd.DataFrame(availability[0:7]).rename(columns={‘value’:’shift’ + ‘1’}).iloc[:,2:].T
df.columns = list(map(lambda x:calendar.day_name[int(x)],df.columns))
df

Does the above code ring any bells? 🙂

Now. Putting all of it together

df = pd.DataFrame()for i in range(0,len(availability),7):

    df = pd.concat([df,pd.DataFrame(availability[i:i+7]).iloc[:,2:].T])

df.index = list(map(lambda x:’Shift ‘+str(x+1),df.reset_index(drop=True).index.values))df.columns = list(map(lambda x:calendar.day_name[int(x)],df.columns))df

Resultant DataFrame

Conclusion

These tasks might not appear so difficult as we have libraries in place that does the job for us but if you are a beginner or an intermediate in python chances are you might have never used some of them before.

Thank for reading.

Just Enough Data Science

master — Sun, 06 Mar 2022 12:41:37 +0000

Learning Timeline

The first question that I am always asked is how much time does it take. I’ll answer it in two ways.

Assuming you’re a working professional– Anything between 10–12 hrs a week stretched for continuous 18–20 weeks should be enough. Remember consistency is the key, remaining consistent can be a big challenge and this where most learners break.
Assuming you’re a full-time learner –In this case, you don’t have the burden of working 6–8hrs a day and you can devote more time to learning. If you can manage 20+ hrs week 10–12 weeks you should be ready and if you are choosing a training program select one that fits this timeline.

Data Science skillset

This is where there lies a big gap in information for the learners. I’ve seen 90% of data science course generally focuses on only three things i.e. Python, statistics, algorithms(including ML/DL) while this is certainly the essentials to of being a data scientists but we need more. Once you are hired your role actually fluctuates. For the outsiders, the tag “Data Scientists” misleads people and sometimes unfairly assumes that it’s only about predictive models. You will find me writing SQL queries, working in excel, building dashboards, and at times even documenting my work in PowerPoint slides.

Here’s the list of skills you should acquire before applying for a Data Science job:

Python
Stats & ML
One Data Visualization tool – Power BI/Tableau
SQL
Exposure to at least one cloud platforms from AWS/Azure/GCP
Python web frameworks like Flask/Django (good to have)

One thing I’ve learned working in the Industry is Data Science is just a process where you use data to add value to the business. Sticking just to algorithms is sometimes is not enough. Additionally, the analysis you do will not be sitting in your local machine, it has to be deployed somewhere so explore #5 and #6 and learn how to deploy/productionize your ML models.

Why have I not mentioned R? R is cool but Python is now lightyears ahead of R. Also, out of every 10 jobs, you’ll find 8 of them would require Python. Python is a scripting language that goes beyond just Data Science. I cannot say the same for R. I recommend learning just one out of R and Python and hence the latter would be a wise pick.

Application

The best way to learn something is implementation. Don’t just limit your learning to exercise and assignments in online/offline classes. Make sure you are building a full-fledged project which explains to the recruiter your ability to work as a data scientist. I recommend building an end-to-end project, I will recommend one in the resource section. Avoid drafting projects like titanic datasets, house price prediction, etc.

When you work on a project always keep these three points in your mind:

What is the objective of the project?
2. What was your role in the project?
3. What impact will your project make to the business/clients/anybody that you are working for?

As a recruiter I am always asking what challenges has the candidate faced and how did he overcome them.

Sample Project: https://www.youtube.com/watch?v=MpF9HENQjDo&list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t

Tip: If you are watching the tutorial just try emulating the steps with a different dataset and also make sure to give credit to the actual creator. Avoid Plagiarism.

How to combat the inexperience

This is by far the biggest challenge that people face. The harsh reality is there are very limited companies appoints fresher. The easiest route in today’s day is to get an internship at any firm and work your way to a full-time job. Managers will prefer in-house interns over external inexperienced candidates. That’s how I see people getting hired in my firm and others.

But what to do if you’re already working in a non-data science or non-IT job with 5+ years of experience or even less?

All those years of experience will be irrelevant when you go for a Data Science job. So before jumping ships please ask yourself the following two questions.

Why do you want to be a data scientist?
2. What is your expectation in terms of pay if you land a full-time job?

Swapping from one domain to other has many disadvantages and you should have good reasons for that. I was teaching a candidate who had 5 years of experience as a java developer and was already earning something around 10 LPA which is very good in terms of Indian standards. When you learn a new skill you are starting fresh. So don’t expect high numbers.

I’m sure you can relate it to other professions. If you are a footballer and you want to play tennis you should not expect right away to play in the same field as the grand slam winners or earn like them. Data Science is no different.

That being said you can still make it to Data Science. I hear success stories every day of people transitioning from non-IT fields to Analytics and Machine Learning. I am one of them. I was a Civil Engineer with no coding experience and here I am working as a Data Scientist for one of the biggest consulting firms in the globe.

So the main question If I’m a civil engineer with 5+ years of work ex. how do I utilize my experience?

Step 1 — Fabricate your experience. Split it into 3+2. Sell a story like this. “I worked as full time … for the past three years and then I was heavily involved in analytics of raw materials/reinforcements/topography etc”

Step 2 — Back it up with working with some relevant projects. Here are few examples.
1. If you are working as transport/roadways designing a project where you’re trying to predict the vehicle density based on past data collected would be a nice one.
2. If you are involved in structural designing then predicting the compressive strength of the concrete based on features like cement, slag, ash, water, fine aggregates, etc.

I can go on with more examples but I hope you get the point. And for people belonging to any other industry, the aim is to show a project relevant to your current line of profession which can be used to make your current process more effective by increasing revenue or limiting losses, or by saving time.

Step 3 — Reach out to someone who is actively working in the field maybe someone like me and ask them to interview you. Ask them to grill with questions regarding the projects mentioned in step 2.

Tip — Data cleaning is an integral aspect of Data Science. Talk to any data scientist they will tell you that 70–80% of their time goes into structuring their data. You can always express your love and desire to clean messy data.
This is something that does not need any experience and works well for beginners and experts and will certainly be appreciated by your interviewer.

Building your own brand

We’re in 2021 and every individual or business needs a professional presence. You should show your presence as well.

LinkedIn is a great way to help launch your career. It’s cheap and effective — if you have a strategy. Over the course of the last 3 years, I have grown my own brand a lot on LinkedIn and people reach out to me to check if I’m looking for a change or interested in taking up their job or be a part of their ventures.

This week my LinkedIn profile exploded and I also received job invites from people from my Network.

I’m at a place where I’m confident enough that if I make myself available to my network I will be getting a call from someone in my network. This is the modern way of job hunting.

The aim is to make people recognize you by your work. Engage with the community. Send connection requests to Data Scientists who are actively working in the field. Be polite and don’t randomly approach them to help with your questions or ask for a referral.

Note: Building a good LinkedIn profile takes time it does not happen overnight so consistency is the key.

I also recommend all my mentees write blogs as they learn. Writing is the best way to validate your learning. And it also a trick to attract your recruiter to your point of strength. Wonder how? Let’s say you have written 5 blogs on Ensemble learning. Chances are you would be asked questions on the same. Projects and blogs are the 1st level of cross-questioning you can expect from a recruiter, they want to first test if you are confident about your own work and you should be able to answer them assuming you must be knowing about the topics you blog.

Lastly, do keep a track of all your exercises and projects in a GitHub repository. It acts like proof that you’ve actually worked on the project that you are talking about. Please note learning will be gradual and hence uploads in GitHub should also be timed.

LinkedIn, Blogpost, and GitHub should be the 1st three things that go in your resume.

Roadmap I recommend

Get yourself enrolled in a course. If possible offline. They say if you’re climbing a mountain, it’s better to do it with someone who has already done it. Keep in mind the following things before choosing a course.

Trainers Profile — Always learn from some who has Industry Experience.
2. Course duration — Maximum 6 months.
3. Cost — Make sure you are not making a hole in the pocket.
4. Syllabus — Get the contents reviewed with someone who’s working in the field.

I learned data science while working full time. I had opted for a course that at the time charged me 2.25 lacs INR (3025 USD). Back in the day’s resources and tutorials were very limited, so I don’t regret the decision. But, since I had invested big, I was determined to work hard and make something out of it. So I was able to remain consistent.

It is too much money for someone who is learning data science. I would recommend not investing anything more than 30k-35k INR (500–550 USD) in any course.

Hope reading this blog was worth your time.

Good Luck on your journey to Data Scientist!