Understanding how to find and parse data

So in the past few days I have been asked by different people that I am aquatinted with about getting a hold of data from several sources and aggregating it into a central location. Now I have a varied background in development and IT so this is really something that should be covered in a case by case basis but I am going to craft a use case that I think will get the concepts across so you can start you work on your specific problem.

Data Mining, enter the E.T.L.

I spent a few years working with a very,  very large telecommunications company (you can probably guess who) and they collect a shit load of data. Network data, phone call data, internet activity, website activity, and many more things. Collecting this data can be a hell of a task, data formats are all over the place along with different connection protocols. This is where an E.T.L. Comes into play. It stands for Extract, Translate, and Load. You extract data from a target, translate it into a format that suits your need, and load it into your own database. This sounds simple in practice and in some cases it is. There are other times that this becomes a work of art as you have to learn how to reverse engineer a system just to get the data you need.

A Simple Use Case

Realtor.com is a great example of a data rich target along with a structure that you have to navigate in order to get exactly what you are looking for. So lets start with a simple search: http://www.realtor.com/realestateagents/08052 this will give you all the agents that are in the area of the 08052 zip code.

This is a clear, repeatable display. Now lets take a quick peak at the code behind it. I use a tool called firebug, its great if you do any web development (http://getfirebug.com/)

This will give you a box like this:

Now if you look, you can see <div class=”agent-list-card clearfix” data-url=”http://www.realtor.com/realestateagents/Keri–Ricci_CHERRY-HILL_NJ_4532_179994593″>. Now look at the pattern: class=”agent-list-card clearfix” is used for each of the agents in what they are calling “cards”. Also the data-url has the link to the agents direct page.

Now, with a little ingenuity you can build a little database…

Enter the Dragon (or Snake in this case)

Python is an extremely versatile and easy to use programming language. I am not going to cover Python tutorials here but you can find a good one here: https://docs.python.org/2/tutorial/

 

To be continued…

What I am going to show you in part 2 is how to connect to a site and work your way to building a comma separated list that we can eventually turn into a database.

Leave a Reply

Your email address will not be published. Required fields are marked *