Python vs. SQL | Pros and Cons
Fast-forward twenty years. Nowadays data comes in many shapes and formats, not necessarily synonymous with relational databases any longer. Data can be found within CSV files, plain text, on the web: unstructured, and in many other different ways and formats. This is where Python and its broad toolset of libraries, shines.
With a huge set of helper libraries and associated platforms, Python is a great programming language for quick, and iterative data exploration. Python’s set of libraries include everything from data visualization to statistical analysis, making it convenient for developers to jump into data analysis and begin identifying patterns.
What’s great about SQL is that it is not mutually exclusive to using Python, both can perfectly co-exist and be used together.
Use cases for SQL and Python
SQL is designed to query and extract data from tables within a database. SQL is really good at allowing you as a developer, to easily join (or merge) several data together. Combining data from multiple tables is a key strength.
However, SQL isn’t really designed for manipulating or transforming data into other formats. Higher level data manipulation that is widely common within data science, such as statistical analysis, regression tests, and time series data manipulation, which is very difficult to achieve using exclusively SQL.
Python, on the other hand, has a well-known data analysis Library, called Pandas, which has been specially designed for data analysis and manipulation.
It is particularly well suited for structured (tabular) data which can be fetched using SQL and then require farther manipulation, which might be difficult to achieve using SQL alone. This leads to the following question: So, when should you use Python vs. SQL?
Simply put, use SQL to retrieve the essential data required for the analysis, and then use specialized Python libraries to process it.
SQL and Python Differences
Say we have the following database table with hotel information called hotels. This is a small sample dataset of the information contained within the table. This table could contain thousands or millions of rows (entries), each one representing a hotel.
|1 Via dei Valeri||Rome||Via Dei Valeri 1||54|
|108 Minutes Hostel||Moscow||Malaya Ordynka Street 5/6 Bld.4 App.3||24|
|1st Arbat Hostel at Novinsky||Moscow||Novinsky Bulvar 1/2 App 64||26|
|A Plus Hostel||Prague||Na Florenci 14 13/33||14|
|Aalborg Hotel Amsterdam||Amsterdam||Sarphatipark 106||39|
106 – 1073
|Adagio Access Vanves Porte de Chatillon||Paris||5 rue Aristide Briand,||53|
|Advisor Hotel Hanoi||Hanoi||51 Nguyen Sieu Street Hoan Kiem||13|
|Afrian House Hotel||Cairo||15 Emad El Din Street||11|
|Agon Aldea Hotel Berlin||Berlin||Bulowstrasse 19-22||22|
|Agustina Serviced Apartments Quezon City||Manila||No. 9 18th St. Brgy Kalusugan (New Manila)||27|
Let’s see how we could query this dataset using SQL and also with Python using Pandas. This will allow us to draw some quick comparisons between both.
|Type of Action||SQL||Python using Pandas|
|Retrieve all the hotels||SELECT * FROM hotels||hotels|
|Retrieve the first 3 hotels||SELECT * FROM hotels LIMIT 3||hotels.head(3)|
|Get the names of hotels in Paris||SELECT Name FROM hotels WHERE City = ‘Paris‘||hotels[hotels.City == ‘Paris‘].Name|
As we can see, so far both are quite similar. However, things start to become more interesting when we need to do more complex queries.
Examples when Python is easier to use than SQL
Let’s say we did some preliminary data querying, and now have a dataset table called hotels_by_city, that contains the number of hotels per city, which looks as follows:
With this new dataset, let’s say we want to order results by Hotel _Count and only select the top ten cities with the largest count. We could do this as follows:
|SQL||Python using Pandas|
|SELECT City FROM hotels_by_city ORDER BY Hotel_Count DESC LIMIT 10||hotels_by_city.nlargest(10, columns=’Hotel_Count‘)|
You can start to see that the Pandas syntax is becoming easier to grasp and write, as the examples of the data being retrieved, become a bit more complicated.
Let’s add an additional layer of complexity, so we now need to determine using this new dataset table, which is the next ten hotels, after the top ten.
|SQL||Python using Pandas|
|SELECT City FROM hotels_by_city ORDER BY Hotel_Count DESC LIMIT 10 OFFSET 10||hotels_by_city.nlargest(20, columns=’Hotel_Count‘).tail(10)|
Notice how the Pandas syntax remains almost unaltered as complexity increases, whereas the SQL syntax becomes more complex to read.
Another way to think about differences between Python and SQL is that SQL allows you to start with a dataset which is structured into tables, from which you use Python, to branch off different analyses in various directions.
Types of SQL
SQL (by itself) is often referring to the language Structured Query Language.
MySQL (pronounced MY Sequel) is the most popular type of SQL database. The super-geek way to refer to this type of SQL database as a relational database management system (or RDBMS). More or less that just means it’s a database with many tables that link together in some way. All WordPress installations by default are using MySQL.
PostgreSQL (often just referred to as “Postgres”) is similar to MySQL, but is known to be more durable. With that said, setting up Postgres for newbies can often be difficult, so if you don’t know where to MySQL is probably you’re best bet.
SQLite (pronounced Sequel Lite) is a very portable, and compact database. SQLite is usually best used for testing and very, very simple applications.
When to choose Python vs. SQL?
SQL contains a much simpler and narrow set of commands compared to Python. In SQL, queries almost exclusively use some combination of JOINS, aggregate functions, and subqueries functions.
Python, by contrast, is like a collection of specialized Lego sets, each with its own specific purpose. Each library has it’s own custom pieces for building something very specific: Pandas for data analysis, PyPDF2 for PDF manipulation, Scikit-learn for machine learning, and much more.
Which language to learn first Python or SQL?
From a learning path perspective, start off with SQL: it’s an essential tool to do any kind of data retrieval from relational databases, even if you’re primary job has little or nothing to do with data analysis.
Even if the SQL query is ten times longer then the equivalent Python script, because it reads like English, it feels easier to do then doing the equivalent in Python when getting started. Remember, learning is harder than typing, and takes more time.
Once you have a reasonable knowledge of SQL, that you can at least write a query that joins two tables together, attempt to rewrite some of that same logic in Python using Pandas.
|Founded||Created in 1991 by Guido Van Rossum||Created by Donald D. Chamberlin and Raymond F. Boyce in 1974|
|Pros||A versatile, dynamic, easy, broad scope||A very easy learning curve, English-like syntax.|
|Cons||Steeper learning curve then SQL||Mostly applicable only in relational databases (with a few exceptions)|
|Web Libraries||A library for almost anything||None, SQL is just SQL.|
|Types||Python 2 and Python 3 are the two most common types of Python||MySQL, SQL Server, PostgreSQL, SQLite|
Almost every high-tech company.
Furthermore, Python is widely used in many organizations for data-science and exploration.
|Nearly every website on the internet uses a SQL database for its back-end.|
Thanks to Eduardo Freitas for his help co-writing and proofing this piece!