A Data Engineer’s Playground: My Beginners Roadmap from Clueless to Slightly Less Clueless
Six months ago I started my official Journey into Data Engineering, and at this point, I had secured my first offer as a Junior Data Engineer. It has been a hell of a journey but as they say — what doesn’t kill you makes you stronger!
Over the months I have followed a roadmap that has brought me to where I am today, and in response to numerous inquiries I’ve received on LinkedIn on how to get started, I decided to write about it extensively. In this blog post, I will outline the steps, tutorials and courses I took to understand certain technologies and enhance my portfolio through personal Data Engineering projects.
For all you Time Jetsetters in need of a quick overview, here are the steps I’ve taken to become a Data Engineer in 2023:
- Understand the “what” and know the “why” (what is Data Engineering?, Why do I want to become a Data Engineer?)
- Research. Find out trending tools in the industry
- Learn the basics (essentially Python, SQL, and Data Modelling)
- Learn fundamental ETL/Data Pipeline technologies
- Build personal projects and make a whole lot of noise about it
- Get that offer — interview preps, apply for internships/junior roles
If you are interested in the nitty gritty, please keep reading!
Step 1: Understand the ‘what’ and know the ‘why’
I used to work as a Python Automation Engineer. This meant I automated tasks that had to do with moving data from one point to the other, essentially tasks that would be done manually. Prior to this, I worked as a Data Scientist but without the ML part…lol. From these descriptions, you could tell I performed a bit of Data Engineering on a very high level.
This gave me an idea of what a Data Engineer really does. Simply put, a Data Engineer is someone who builds and manages the systems and infrastructure for collecting, storing, and processing data, ensuring it’s readily available and accessible for analysis and decision-making. Now I understood “what” it was, but “why”?
For me it was pretty simple; I love working with Data, I have a thing for DevOps, and I admire the idea of cloud computing. I realised Data Engineering gives me the opportunity to combine these three components and of course, make some money while doing it :)
Step 2: Research…..lots of it!
As of 2023 when this post was written, the technologies used in the Data Engineering space are quite different from what was used in 2010, but how do you know that? through Research!💡
This allows you to discover trending technologies, their applications, and project concepts to construct a strong portfolio.
Now the easiest way to research trending technologies is by looking at various job descriptions most especially on LinkedIn and other job posting platforms. Here are some of the in-demand technologies I came across during my research: Pyspark, NoSql, Sql, Snowflake, Bigquery, Airflow, Databricks, Terraform, Kafka, AWS Lambda, Glue, Athena, EC2, S3, Azure Data Factory, DBT, Azure Blob storage, etc. A lot right? well knowing these guided me in figuring out what to learn and creating projects that will utilize some of these tools.
This holds significant importance because by engaging with their informative posts on their socials, you can gain inspiration for sought-after tools and Data Pipeline projects that you can undertake.
Now that you have an idea of the tools in demand, choose a cloud platform to start with and curate a checklist of these tools you will need to incorporate into your personal projects. I strongly recommend AWS primarily because of its expansive community, which offers numerous tutorials, courses, and well-documented resources.
Keep in mind that cloud platforms share similar underlying concepts, so gaining proficiency in one provides you with an advantage. However, if you wish to explore the workings of other cloud platforms, you can always venture into them once you’ve built a solid foundation in AWS.
Step 3: Learn the basics (Python, SQL, Data Modelling)
Some Data Engineers write Java or Scala but Python is more common and widely used due to its simplicity and flexibility. As a Python major, the very core of your Data Engineering pipelines would be Python and SQL. This is because most of your interactions with various tools will be done through Python APIs or Structured Query Languages. Coming from a Python background this wasn't hard for me, I just needed some SQL brush-ups because mine was very rusty, lowkey I still feel it is :(
It is also very important to understand how to model data for Business Intelligence purposes. Getting the hang of different database structures, sketching out how data entities relate, and picking the right databases for tasks like real-time processing (OLTP) or data analysis (OLAP) is crucial stuff. I took this lengthy but insightful course on Udemy to dive into the world of Analytics Engineering. It also had a hands-on section on DBT and Bigquery which was a check off my list :)
Step 4: Learn fundamental ETL/Data Pipeline technologies
This is a very crucial step because at this point you could get overwhelmed or confused but remember to take it one step at a time.
With your list of technologies and data pipeline ideas in hand, it’s time to start exploring and learning about them. Now you have to understand at a high level what these tools are used for in the sense of ETL operations.
I always recommend you learn these technologies individually, before incorporating them into your pipelines. Outlined below are courses and documentation I used to understand some in-demand technologies;
- Series on Introduction to Kafka
- Free Snowflakes tutorial on Udemy
- MongoDB tutorial
- My Introductory Tutorial to Terraform
I did not learn all these at once because I do not have ten heads :) instead, for each personal project I wanted to work on, I would pick one or two technologies from my checklist, learn about them separately, and learn how to integrate them into my pipelines. This gave me a good sense of direction.
step 5: Build personal projects and make a whole lot of noise about it
When building your personal project you have to make sure you cover bits and pieces from some important concepts such as:
- Data Extraction/ Ingestion (Scrapy, Requests, Selenium, AWS Glue)
- Data Transformation (Pyspark, Dbt, Pandas)
- Data Loading/Warehousing (Bigquery, snowflake, S3, Azure blob storage, AWS Athena, Redshift)
- Data Streaming (Kafka, Spark Structured Streaming, AWS Kinesis)
- Data Orchestration tools (Airflow, Dagster, Azure Data Factory)
- CI/CD (Git, Gitlab, bash)
- Deployments (AWS SAM, AWS Lambda, Docker, Terraform, EC2, Databricks)
- Cloud service providers (AWS, Azure, GCP)
Ofcus you do not have to include all these concepts in one project.
My first personal project was from a course I took on Udemy. It introduced me to the Azure cloud environment, Databricks, Pyspark, Azure blob storage and Azure data factory. The project repository can be found here. Before I took this course, I read the overview to know the technologies used and if they matched those on my list. This is vital as this ‘one’ project enabled me to work with three/four tools while making sure I covered key concepts in Data Engineering.
To use Azure, signing up automatically awards you a $200 free credit for a month which is enough to carry out a project like this.
After this, I followed some tutorials curated by Darshil Parmer on YouTube and tweaked them a bit to make them my own. It is always a good learning experience when you add extra spices to tutorials created by someone else. Your spice could be a different form of extraction, an automatic data loading system, or a containerized environment with docker or Kubernetes, and so on. Here are two tutorials I utilized to create my projects:
- 📈 Stock Market Real-Time Data Analysis Using Kafka | End-To-End Data Engineering Project
- Twitter Data Pipeline using Airflow for Beginners | Data Engineering Project
You can find my rendition inspired by these tutorials here;
- Real-time end-to-end Pipeline Using AWS and Kafka
- Data Pipeline Orchestration using Airflow, AWS, and Snowflake
No doubt, as you work on these personal projects, errors and blockers will pop their ugly heads but don’t throw in the towel! see it through by reading documentation, asking questions, using AI tools like ChatGPT, and most importantly — googling effectively!
To learn and assimilate better I got a book where I write about everything I study and every project I work on. I document the drawbacks of my pipelines, recommendations, etc. I also make sure my projects are hosted on GitHub for future reference.
Don’t overlook the importance of sharing your project progress, especially on LinkedIn! This is a great way to get noticed and establish your identity as a Data Engineer. It also helps upcoming Engineers just as I have helped you :)
Step 6: Get that offer: interview preps, apply for internships/junior roles
Now this is where you are ready for a real-world experience. I promise you, your projects and your portfolio will speak for you at this point, and if you are as lucky as I was, a recruiter’s message can land you your first gig! If you ask me, this is the fastest way to get jobs these days.
Before then, keep practising interview questions, keep applying, keep learning new things, and reach out to connections on LinkedIn for an internship or junior opportunities. Posting related topics on your socials will build a level of engagement to increase your chances of getting noticed for a job offer. It might take a while and might be very frustrating but please don’t stop! and when you get that offer pace yourself for a whole new learning journey!
I really hope this has provided you with some clarity and a roadmap. You can view and engage with all my personal projects here and do well to follow me on LinkedIn because I’m something of a Data Engineering Influencer myself :)