THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Jamie Thomson

This is the blog of Jamie Thomson, a freelance data mangler in London

My first venture into Hadoop

If you keep an eye on the tech industry you cannot have failed to notice that an Open Source technology called Hadoop is getting lots of coverage these days. You’ll be able to find lots of descriptions around the web about what Hadoop actually is so I won’t try and come up with one myself other than to say its a technology for storing and querying over large sets of data.

Microsoft recently released a preview of their Hadoop offering which they are calling HDInsight; it can be installed on your own hardware (HDInsight Server) or you can sign up temporarily for their Azure-based offering (HSInsight Service). Learn more at http://www.microsoft.com/bigdata.

Cindy Gross has provided an excellent step-by-step tutorial on how to store data on HDInsight (either via the service or server) and query it in her blog post Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View. A few days ago I downloaded HDInsight server and followed Cindy’s instructions for installing it, inserting data, querying it and then building a Power View report atop it and the whole thing took less than two hours. Hence, if you’re looking to get familiar with Hadoop and you’re primarily a Windows guy like me then this is a really quick and easy (and free) way of going about it.

Cindy’s tutorial demonstrates how to combine data from Hadoop with data from SQL Server using PowerPivot and the output of that combined data is a Power View report:

image

My attempt is available on SkyDrive at http://sdrv.ms/Zhv4tB although be aware that Power View doesn’t work on SkyDrive so you’ll have to download the containing Excel workbook and open it in Excel 2013 if you want to see anything useful.


As I said above the steps to installing HDInsight, sticking data in it, and querying it were very easy thanks to Cindy’s thorough blog post and hence I was able to make some observations about this murky world of Hadoop that I’ve been hearing so much about.

Firstly, I gather that the “normal” way of querying Hadoop is to write programs in Java however a technology called Hive provides an abstraction over Hadoop that can make the data therein appear as a table with rows and columns just as those of us immersed in the RDBMS world are used to. For example, the data provided by Cindy was provided as a text file:

SNAGHTML10fe22b5

which literally gets stored in Hadoop as-is. Hive allows us to surface that file as a table using the following syntax:

CREATE EXTERNAL TABLE census (
  State_FIPS int
, County_FIPS int
, Population bigint
, Pop_Age_Over_69 bigint
, Total_Households bigint
, Median_Household_Income bigint
, KeyID string
)
COMMENT 'US Census Data'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/user/demo/census';

Its fairly intuitive to look at this CREATE EXTERNAL TABLE definition and understand exactly what Hive is doing. Its taking a text file whose “columns” are delimited by tabs and defining data types for those columns.

There also exists an ODBC driver for Hive that enables one to query the data from anything that can talk ODBC (e.g. Excel).

I have often heard it said that Hadoop is for querying over unstructured data however, to me, that text file has got lots of structure to it; its got column delimiters and row delimiters, data in each row is stored consistently. That observation led me to question the use of the word "”unstructured” and I asked Cindy to clarify what that word meant in this context. She replied that its more a question of when structure is applied (in this case its being applied by Hive after the data is stored) and she correctly pointed out that a different structure could have been defined in the CREATE EXTERNAL TABLE command without modifying the underlying data.

If nothing else this tutorial helped to crystalize some of the marketing fluff that gets bandied about regarding Hadoop. Yes, you can probably store anything you like in there (hence “unstructured”) however in order to get any value from it you’re going to have to apply structure at some point even if that’s as simple as breaking a document into all the individual words by specifying a whitespace delimiter.

That’s a short rundown of what was an equally short investigation. If Hadoop is something that interests you then downloading HDInsight Server and following Cindy’s tutorial might be a good place to start.

@Jamiet

Published Wednesday, February 20, 2013 10:56 PM by jamiet
Filed under: ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Emil Glowia said:

I'm not sure if this particular set of data is unstructured? I prefer Inmon definition of unstructured that splits it into textual: emails, telephone conversations, PowerPoint presentations and more; and nontextual unstructured data which is graphics and images, x-rays , diagrams etc which he covers in this book "DW 2.0 The Architecture for the Next Generations of Data Warehousing".

To me this data is just source and process of creating a table is the same as creating a buffer by SSIS and then loading to destination table.

Take care

Emil

February 22, 2013 6:03 AM
 

Cindy Gross said:

It is very rare for data in and of itself to not have "any" structure. In the big data world unstructured, multi-structured, and semi-structured really refer to the fact that you load the data first and put (and possibly rapidly modify) the structure/metadata as you query it in different ways. It's more about the concept of the schema coming after the data creation/load and being flexible.

April 25, 2013 6:11 PM
 

vignesh said:

Thanks for your information.<a href="http://www.hadooptrainingchennai.co.in">Hadoop Training in Chennai</a>

July 2, 2014 12:43 AM

Leave a Comment

(required) 
(required) 
Submit

This Blog

Syndication

Powered by Community Server (Commercial Edition), by Telligent Systems
  Privacy Statement