Studying SQL the Arduous Approach

By Rahul Agarwal, Senior Statistical Analyst at WalmartLabs



A Information Scientist who doesn’t know SQL isn’t price his salt.

And that appears appropriate to me in each sense of the world. Whereas we really feel way more completed creating fashions and arising with the completely different hypotheses, the function of information munging can’t be understated.

And with the ubiquitousness of SQL in the case of ETL and information preparation duties, everybody ought to know somewhat little bit of it to at the very least be helpful.

I nonetheless bear in mind the primary time I obtained my palms on SQL. It was the primary language (in case you can name it that) I discovered. And it made an affect on me. I used to be in a position to automate issues, and that was one thing I hadn’t considered earlier than.

Earlier than SQL, I used to work with Excel — VLOOKUPs and pivots. I used to be creating reporting techniques, doing the identical work many times. SQL made all of it go away. Now I may write a giant script, and every little thing can be automated — all of the crosstabs and evaluation generated on the fly.

That’s the energy of SQL. And although you might do something that you simply do with SQL utilizing Pandas, you continue to have to be taught SQL to cope with techniques like HIVE, Teradata and generally Spark too.

This put up is about putting in SQL, explaining SQL and working SQL.


Organising the SQL Atmosphere

Now one of the best ways to be taught SQL is to get your palms soiled with it(Identical I can say for some other factor you wish to be taught)

I’ll advise towards utilizing the web-based recipes like w3schools/tutorialspoint for SQL since you can not use your information with these.

Additionally, I’ll advise you to go together with studying the MySQL flavour of SQL as it’s Open Supply, straightforward to arrange in your laptop computer and has a fantastic shopper named MySQL Workbench to make your life simpler.

As we have now gotten these factors out of the way in which, here’s a step-by-step to get arrange with MySQL:

  • You’ll be able to download MySQL to your explicit system (MACOSX, Linux, Home windows) from Download MySQL Community Server. In my case, I downloaded the DMG Archive. After that, double click on and set up the file. You would possibly have to arrange a password. Keep in mind this password as it might be required to hook up with the MySQL occasion later.

  • Create a file named my.cnf and put the next in it. That is wanted to offer Native file learn permissions to your SQL database.
port= 3306
port= 3306
  • Open up System Preferences>MySQL. Go to Configuration and browse to the my.cnf file utilizing the choose button.

  • Restart the server from Situations tab by clicking cease and begin.

  • When you get that server working, download and set up the MySQL Workbench: Download MySQL Workbench. The workbench provides you an editor to write down your SQL Queries and get the ends in a structured approach.

  • Open up the MySQL workbench now and hook up with SQL by it. You will notice one thing like under.

  • You’ll be able to see that the Native Occasion connection has been arrange for you beforehand. Now, you simply have to click on on that connection and get began utilizing the password that we arrange earlier than for the MySQL server(You may also create a connection to an current SQL server that may not be in your machine in case you have the deal with, port quantity, username and password).

  • And also you get an editor to write down your queries on the actual database.

  • Verify the Schemas tab on the highest left to see the tables which might be current. There’s only a sys schema current with the desk sys_config. Not an fascinating information supply to be taught SQL. So let’s set up some information to follow.
  • When you have your personal information to work. Then good and fantastic. You’ll be able to create a brand new schema(database) and add it into tables utilizing these following instructions. (You’ll be able to run the instructions through the use of Cmd+Enter or by clicking the ⚡️lightning button)

On this tutorial, nevertheless, I’m going to make use of the Sakila Film database which you’ll be able to set up utilizing the next steps:

  • Go to MySQL Documentation and download the Sakila ZIP file.
  • Unzip the file.
  • Now go to MySQL Workbench and choose File>Run SQL Script>choose location sakila-db/sakila-schema.sql
  • Go to MySQL Workbench and choose File >Run SQL Script >choose location sakila-db/sakila-data.sql

When you try this, you will notice a brand new database added within the SCHEMA record.


Taking part in with Information

Now we have now some information with us. Lastly.

Let’s begin with writing some queries.

You’ll be able to attempt to perceive the Schema of the Sakila Database intimately utilizing the Sakila Sample Database doc.


Schema Diagram


Now the essential syntax of any SQL question is:

SELECT col1, SUM(col2) as col2sum, AVG(col3) as col3avg FROM table_name WHERE col4 = 'some_value' GROUP BY col1 ORDER BY col2sum DESC;

There are 4 components on this question:

  1. SELECT: Which Columns to pick? Right here we select col1 and do SUM aggregation on col2 and AVG aggregation on col3. We additionally give a brand new identify to SUM(col2) through the use of the as key phrase. This is named aliasing.
  2. FROM: From which desk ought to we SELECT?
  3. WHERE: We are able to filter information utilizing WHERE statements.
  4. GROUP BY: All chosen columns that aren’t in aggregation needs to be in GROUP BY.
  5. ORDER BY: Kind on col2sum

The above question will aid you with a lot of the easy belongings you wish to discover in a database.

For instance, we will learn how otherwise censored rated motion pictures are timed otherwise utilizing:

SELECT ranking, avg(size) as length_avg FROM group by ranking order by length_avg desc;


Train: Body a Query

It is best to now provide you with some questions of your personal.

For Instance, you possibly can attempt to discover out all the flicks launched within the 12 months 2006. Or attempt to discover all the motion pictures which have a ranking of PG and size larger than 50 minutes.

You are able to do this by working the next on MySQL Workbench:

choose * from the place release_year = 2006; 
choose * from the place size>50 and ranking="PG";


Joins in SQL

Until now, we have now discovered how we will work with single tables. However in actuality, we have to work with a number of tables.

So, the subsequent factor we’d wish to be taught is methods to do joins.

Now joins are an integral and a necessary a part of a MySQL Database and understanding them is critical. The under visible talks about a lot of the joins that exist in SQL. I often find yourself utilizing simply the LEFT JOIN, and INNER JOIN, so I’ll begin with LEFT JOIN.

The LEFT JOIN is used if you wish to hold all of the data within the left desk(A) and merge B on the matching data. The data of A the place B isn’t merged are stored as NULL within the ensuing desk. The MySQL Syntax is:

SELECT A.col1, A.col2, B.col3, B.col4 from A LEFT JOIN B on A.col2=B.col3

Right here we choose col1 and col2 from desk A and col3 and col4 from desk B. We additionally specify which widespread columns to affix on utilizing the ON assertion.

The INNER JOIN is used if you wish to merge A and B and solely to maintain the widespread data in A and B.



To provide you a use case lets return to our Sakila database. Suppose we needed to learn how many copies of every film we do have in our stock. You will get that through the use of:

SELECT film_id,depend(film_id) as num_copies FROM sakila.stock group by film_id order by num_copies desc;

Does this consequence look fascinating? Probably not. IDs don’t make sense to us people, and if we will get the names of the flicks, we’d be capable of course of the knowledge higher. So we snoop round and see that the desk movie has obtained film_id in addition to the title of the movie.

So we have now all the information, however how can we get it in a single view?

Come Joins to the rescue. We have to add the title to our stock desk data. We are able to do that utilizing —

SELECT A.*, B.title from sakila.stock A left be part of B on A.film_id = B.film_id

It will add one other column to your stock desk data. As you would possibly discover some movies are within the movie desk that we don’t have within the stock. We used a left be part of since we needed to maintain no matter is within the stock desk and be part of it with its corresponding counterpart within the movie desk and never every little thing within the movie desk.

So now we have now obtained the title as one other subject within the information. That is simply what we needed, however we haven’t solved the entire puzzle but. We wish title and num_copies of the title within the stock.

However earlier than we will go any additional, we must always perceive the idea of interior queries first.


Interior Question:

Now you may have a question that may provide the above consequence. One factor you are able to do is create a brand new desk utilizing

create desk sakila.temp_table as SELECT A.*, B.title from sakila.stock A left be part of B on A.film_id = B.film_id;

After which use a easy group by operation utilizing:

choose title, depend(title) as num_copies from sakila.temp_table group by title order by num_copies desc;

However that is one step too many. And we have now to create a short lived desk that finally ends up taking area on the system.

SQL gives us with the idea of the interior question only for these form of points. You’ll be able to as an alternative write all this in a single question utilizing:

choose temp.title, depend(temp.title) as num_copies from (SELECT A.*, B.title from sakila.stock A left be part of B on A.film_id = B.film_id) temp group by title order by num_copies desc;

What we did right here was sandwich our first question in parenthesis and gave that desk an alias temp. We then did the group by operations contemplating temp simply as we’d take into account any desk. It’s due to the interior question idea that we will write SQL queries that span a number of pages at some occasions.


The HAVING Clause

HAVING is yet one more SQL assemble that’s helpful to grasp. So we have now obtained the outcomes, and now we wish to get the movies whose variety of copies are lower than or equal to 2.

We are able to do that through the use of the interior question idea and the WHERE clause. Right here we nest one interior question inside one other. Fairly neat.

Or, we will use the HAVING Clause.

The HAVING clause is used to filter on the ultimate aggregated consequence. It’s completely different from WHERE as the place is used to filter the desk that’s used within the FROM assertion. HAVING filters the ultimate consequence after the GROUP BY occurs.

There are numerous methods to do the identical factor with SQL as you may have already seen within the above instance. We have to attempt to provide you with the least verbose and thus HAVING is smart in lots of circumstances.

When you can comply with this far, you already know extra SQL than most individuals.

Subsequent factor to do: Observe.

Attempt to provide you with your questions in your dataset and attempt to discover out the solutions you may have utilizing SQL.

Some questions I may present for a begin:

  1. Which Actor has essentially the most distinct movies in our stock?
  2. Which Style movies are essentially the most rented in our stock?


Proceed Studying

This was only a easy tutorial on methods to use SQL. If you wish to be taught extra about SQL, I wish to name out a superb course on SQL for Data Science from the College of California. Do test it out because it talks about different SQL ideas like UNION, String Manipulation, capabilities, Date Dealing with, and many others.

I’m going to be writing extra beginner-friendly posts sooner or later too. Observe me up at Medium or Subscribe to my blog to be told about them. As at all times, I welcome suggestions and constructive criticism and will be reached on Twitter @mlwhiz.

Additionally, a small disclaimer — There is likely to be some affiliate hyperlinks on this put up to related sources, as sharing data is rarely a foul concept.

Bio: Rahul Agarwal is Senior Statistical Analyst at WalmartLabs. Observe him on Twitter @mlwhiz.

Original. Reposted with permission.


About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *