Python vs. SQL | Pros and Cons
Approximately twenty years ago, there were only a handful of programming languages that a software engineer would need to know well. Even back then, Structured Query Language, or SQL, was the go-to language when you needed to gain quick insight on some data, fetch records, and then draw preliminary conclusions that might, eventually, lead to a report or to writing an application.
Nowadays, data comes in many shapes and formats, not necessarily synonymous with relational databases any longer. Data exists within CSV files, plain text, and on the web, as well as in many other different formats. It is in this maze of data that Python’s broad toolset of libraries shine.
With a vast set of helper libraries and associated platforms, Python is an excellent programming language for quick, and iterative data exploration. Python’s set of libraries include everything from data visualization to statistical analysis, making it convenient for developers to jump into data analysis and begin identifying patterns.
Use cases for SQL and Python
SQL is designed to query and extract data from tables within a database. SQL is good at allowing you as a developer, to seamlessly join (or merge) several data together. Combining data from multiple tables is a key strength.
However, SQL isn’t designed for manipulating or transforming data into other formats. Types of higher-level data manipulation that are common within data science, such as statistical analysis, regression tests, and time-series data manipulation, are is very difficult to achieve using SQL exclusively.
Python, on the other hand, has a well-known data analysis Library called Pandas, which has been specially designed for data analysis and manipulation.
Python is particularly well suited for structured (tabular) data which can be fetched using SQL and then require farther manipulation, which might be challenging to achieve using SQL alone. This leads to the following question: When should you use Python vs. SQL?
SQL and Python Differences
Use SQL to retrieve the essential data required for the analysis, and then use specialized Python libraries to process it.
Let’s take a look at an example using SQL:
Say we have the following database table with hotel information called hotels. This is a small sample dataset of the information contained within the table. This table could include thousands or millions of rows (entries), each one representing a hotel.
|1 Via dei Valeri||Rome||Via Dei Valeri 1||54|
|108 Minutes Hostel||Moscow||Malaya Ordynka Street 5/6 Bld.4 App.3||24|
|1st Arbat Hostel at Novinsky||Moscow||Novinsky Bulvar 1/2 App 64||26|
|A Plus Hostel||Prague||Na Florenci 14 13/33||14|
|Aalborg Hotel Amsterdam||Amsterdam||Sarphatipark 106||39|
106 – 1073
|Adagio Access Vanves Porte de Chatillon||Paris||5 rue Aristide Briand,||53|
|Advisor Hotel Hanoi||Hanoi||51 Nguyen Sieu Street Hoan Kiem||13|
|Afrian House Hotel||Cairo||15 Emad El Din Street||11|
|Agon Aldea Hotel Berlin||Berlin||Bulowstrasse 19-22||22|
|Agustina Serviced Apartments Quezon City||Manila||No. 9 18th St. Brgy Kalusugan (New Manila)||27|
Let’s compare how we could query this dataset using SQL versus Python using Pandas.
|Type of Action||SQL||Python using Pandas|
|Retrieve all the hotels||SELECT * FROM hotels||hotels|
|Retrieve the first 3 hotels||SELECT * FROM hotels LIMIT 3||hotels.head(3)|
|Get the names of hotels in Paris||SELECT Name FROM hotels WHERE City = ‘Paris‘||hotels[hotels.City == ‘Paris‘].Name|
As we can see, so far, both are quite similar. However, things start to become more interesting when we need to do more complex queries.
Examples when Python is easier to use than SQL
Let’s say we did some preliminary data querying, and now have a dataset table called hotels_by_city, that contains the number of hotels per city, which looks as follows:
With this new dataset, let’s say we want to order results by Hotel _Count and only select the top ten cities with the largest count. We could do this as follows:
|SQL||Python using Pandas|
|SELECT City FROM hotels_by_city ORDER BY Hotel_Count DESC LIMIT 10||hotels_by_city.nlargest(10, columns=’Hotel_Count‘)|
You can start to see that the Pandas syntax is becoming easier to grasp and write, as the examples of the data being retrieved, become a bit more complicated.
Let’s add a layer of complexity, so we now need to determine using this new dataset table, which is the next ten hotels, after the top ten.
|SQL||Python using Pandas|
|SELECT City FROM hotels_by_city ORDER BY Hotel_Count DESC LIMIT 10 OFFSET 10||hotels_by_city.nlargest(20, columns=’Hotel_Count‘).tail(10)|
Notice how the Pandas syntax remains almost unaltered as complexity increases, whereas the SQL syntax becomes more complex to read.
Another way to think about differences between Python and SQL is that SQL provides the setup and Python takes you from there. SQL allows you to start with a dataset which is structured into tables. Once you have the tables, you use Python to branch off different analyses in various directions.
Types of SQL
MySQL (pronounced MY Sequel) is the most popular type of SQL database. The super-geek way to refer to this type of SQL database as a relational database management system (or RDBMS). More or less, that means it’s a database with many tables that link together in some way. All WordPress installations by default are using MySQL.
PostgreSQL (often just referred to as “Postgres”) is similar to MySQL but is known to be more durable. However, setting up Postgres for newbies can often be difficult, so if you are new to this, MySQL is probably you’re best bet.
SQLite (pronounced Sequel Lite) is a very portable, and compact database. SQLite is usually best used for testing and very, very simple applications.
When to choose Python vs. SQL?
SQL contains a much simpler and narrow set of commands compared to Python. In SQL, queries almost exclusively use some combination of JOINS, aggregate functions, and subqueries functions.
Python, by contrast, is like a collection of specialized Lego sets, each with a specific purpose. Each library has custom pieces for building a particular type of a particular thing: Pandas for data analysis, PyPDF2 for PDF manipulation, Scikit-learn for machine learning, and much more.
Which language to learn first Python or SQL?
We think that the best place to start is by learning SQL. SQL is an essential tool for any kind of data retrieval from relational databases, even if you’re primary job has little or nothing to do with data analysis.
Even if the SQL query is ten times longer then the equivalent Python script, it feels easier to do then doing the equivalent in Python because it reads like English. Remember, learning is more laborious than typing, and takes more time.
Once you have a reasonable knowledge of SQL, that you can at least write a query that joins two tables together, attempt to rewrite some of that same logic in Python using Pandas.
|Founded||Created in 1991 by Guido Van Rossum||Created by Donald D. Chamberlin and Raymond F. Boyce in 1974|
|Pros||A versatile, dynamic, easy, broad scope||A very easy learning curve, English-like syntax.|
|Cons||Steeper learning curve then SQL||Mostly applicable only in relational databases (with a few exceptions)|
|Web Libraries||A library for almost anything||None, SQL is just SQL.|
|Types||Python 2 and Python 3 are the two most common types of Python||MySQL, SQL Server, PostgreSQL, SQLite|
Almost every high-tech company.
Furthermore, Python is widely used in many organizations for data-science and exploration.
|Nearly every website on the internet uses a SQL database for its back-end.|
Thanks to Eduardo Freitas for his help co-writing and proofing this piece!