Using Python as SQL with Pandas DataFrame

Ruoxinli
2 min readNov 28, 2020

SQL is very easy to use, but what if you need to achieve the same operations in Python? Here are the corresponding operations in Python using Pandas DataFrame(DF). All examples are written as in SQL Server.

Below including DISTINCT, WHERE, LIKE, IN/NOT IN, ORDER BY, GROUP BY, CASE WHEN, ROW_NUMBER(), JOIN, UNION ALL/UNION, COUNT, DROP COLUMN, and RENAME COLUMN.

  • Distinct. This operation drops the duplicates in the specified columns in the DF.
SQL:    select distinct col1, col2 from df
Python: df_dedup = df.drop_duplicates(subset= [‘col1’, ‘col2’])
  • Where
SQL:    select * from df where col1=value1 and col2=value2
Python: df[ (df[‘col1’]==’value1’) & (df[‘col2’]=='value2’)]
  • Like. Python is case sensitive, SQL is not case sensitive.
SQL:    select * from df where col like '%abc%'
Python: df[df[‘col’].str.contains(‘ABC’, na=False, regex = False)]
  • In/Not in
SQL: select * from df where col in (value1, value2)
select * from df where col not in (value1, value2)
Python:
in - df[df[‘col’].isin([‘value1’,’value2’])]
not in - df[~df[‘col’].isin([‘value1’,’value2’])]
  • Order by
SQL:    select * from df order by col1, col2 desc
Python: df.sort_values(['col1’,’col2’],ascending=[True,False])
  • Group by
SQL:    select avg(*) from df group by col
Python: df.groupby(‘col’).mean().reset_index()
  • Case When.
SQL: Select 
case when ConditionCol <= 550 then 'low'
when ConditionCol>550 and ConditionCol<720 then 'medium'
when ConditionCol>=720 then 'high' end as newCOl
from df
Python:
df['newCol'] = df['Col']
df.loc[df['ConditionCol']<=550, 'newCol'] = 'low'
df.loc[df['ConditionCol'].between(550,720, inclusive = False), 'newCol'] = 'medium'
df.loc[df['ConditionCol']>=720, 'newCol'] = 'high'
  • Window function row_number()
SQL: select row_number() over(partition by partitionCol order by orderCol) as newCol from dfPython: 
df[‘newCol’] = df.groupby(‘partitionCol’)[‘orderCol’].rank(method = ‘first’)
  • Join. pd.merge has four types in the argument ‘how’: left, right, inner, outer. Use the arguments left_on, right_on when the common columns are named differently.
SQL:
select * from pd1 full join pd2 on pd1.id = pd2.id
Python:
df_join = pd.merge(df1, df2, how = ‘outer’, on = ‘Id’)
  • Union all/ Union
SQL: 
select * from df1 union all select * from df2
select * from df1 union select * from df2
Python:
# union all
df_unionall = pd.concat([df1, df2]).reset_index(drop = True)
# union
df_union = pd.concat([df1, df2]).reset_index(drop = True).drop_duplicates()
  • Count. Python DF.value_counts will only return the selected column and the count of each value.
SQL     select col, count(*) from df group by col
Python: df[‘col’].value_counts().rename_axis(‘col’).reset_index(name = ‘counts’)
  • Drop column
SQL: alter table df
drop column col1, col2
Python:
ColtoDrop = [‘col1’,’col2’]
df.drop(columns=ColtoDrop)
  • Rename column
SQL: select col newName, col2, col3...from df
Python: df.rename(columns={‘col’:’newName’})
# rename returns a new df with the new columns name. Use parameter "inplace=True" to change the columns name inplace, this will return None.

Hope this helps and I will continue to add to this list.

--

--