Ease of structured data and efficiency of Spark

Image for post
Image for post
Photo by Sorin Sîrbu on Unsplash

Shilpa has become an expert in Spark and enjoys Big data analysis. Everything was going well until her employer wanted to know the kind of insight they can get by combining their enterprise data from the Oracle database with Big Data.

Oracle database is the most sold enterprise database. Most of the enterprise applications, like ERP, SCM applications, are running on the Oracle database. Like Shilpa, most of the data scientists come across situations where they have to relate the data coming from enterprise databases like Oracle with the data coming from a Big Data source like Hadoop.

There are two approaches to address such…


Step-by-step instructions on downloading, installing, configuring, and running fully functional Oracle Database on AWS EC2 instance.

Image for post
Image for post
Photo by Renny Chan on Unsplash

Sam was thrilled to get an employment offer as a data engineer in a startup. On his first day in the office and he was up for a surprise: startup founders didn’t know the difference between data engineer and database administrator, and Sam was expected to install a fully functional database on AWS EC2 instance.

If you are in a similar situation or a rookie dba and want a detailed step-by-step process of installing a fully functional server-class database on an AWS EC2 instance, this article is for you. In this article, I am going to take you through the installation of a fully functional server-class Oracle database 12c on AWS EC2 instance. …


From understanding core concepts to developing a well-functioning Spark application on AWS Instance!

Image for post
Image for post
Photo by AJ Garcia on Unsplash

Shilpa, a rookie data scientist, was in love with her first job with a budding startup: an AI-based Fintech innovation hub. While the startup started with the traditional single machine, vertical expansion, infrastructure, it soon realized that the volume and velocity of data it deals with needs a Big Data solution. While HADOOP was a good choice for data storage, Spark, with its in-memory computation capability and real-time data streaming became the obvious choice as a data execution engine.

Shilpa had a steep learning curve to conquer. While she understands data and is a Python expert, Spark is new to her and she has little time to master the jungle of knowledge available in bits and pieces on the internet: Big data, Hadoop Distributed File System (HDFS), Spark, Spark Context, Driver, Executor, Python, Java Virtual Machine (JVM), Compute Cluster, Yet Another Resource Negotiator (YARN). …


Rendezvous of Python, SQL, Spark, and Distributed Computing making Machine Learning on Big Data possible

Image for post
Image for post
Photo by Ben Weber on Unsplash

Shilpa was immensely happy with her first job as a data scientist in a promising startup. She was in love with SciKit-Learn libraries and especially Pandas. It was fun to perform data exploration using the Pandas dataframe. SQL like interface with quick in-memory data processing was more than a budding data scientist can ask for.

As the journey of startup matured, so did the amount of data and it started the chase of enhancing their IT systems with a bigger database and more processing power. Shilpa also added parallelism through session pooling and multi-threading in her Python-based ML programs, however, it was never enough. Soon, IT realized they cannot keep adding more disk space and more memory, and decided to go for distributed computing (a.k.a. …


A combination no data scientist can ignore

Image for post
Image for post
Photo by bruce mars on Unsplash

While learning data science, most of the time we use open-source data in Excel, CSV formats. However, real-world data science projects involve accessing data from databases. If data needs to be stored and accessed, it needs a database. Two major types of modern databases are relational (RDBMS) and non-relational databases (also called NoSQL database).

RDBMS database works well with the data that can be stored in rows and columns. Consistency and speed are the greatest strength of the RDBMS database.

In this article, I will take you through the most used RDBMS database- Oracle Database- and explain how you can use the holy combination of Python application on the Oracle database. …


Best of all possible worlds for Data Science and Machine Learning professionals

Image for post
Image for post
Photo by Zeshalyn Capindo on Unsplash

Most of the data science and machine learning professionals are in love with the Jupyter notebook. Barring few anomalies, most of the data science and machine learning works are happening in the cloud. So there is no option but to use Jupyter on the cloud.

In my previous article, I explained how you can download the Jupyter notebook on AWS Linux instance and access it on your client machine using Xming and X11 forwarding. Below is the link to that article

In this article, I am explaining how to access the Jupyter notebook in AWS EC2 instance directly through a Docker image. …


A great mix of ubiquitous cloud and usability of Jupyter

Image for post
Image for post
Photo by Robert Collins on Unsplash

Jupyter is the most used development tool by data science and machine learning professionals. Considering most of the data science and machine learning projects involve working with data on the cloud, it makes sense to use the Jupyter notebook on the cloud. The flexibility of Jupyter on a ubiquitous cloud instance is a win-win combination.

In this article, I will take you through the steps to install and run a Jupyter notebook from a Linux instance on AWS EC2 instance.

Prerequisite Tools

I am assuming that you already have an AWS EC2 instance and know how to access it through Putty. …


Imperative, Functional, Object-Oriented: Which programming paradigm is a good fit for my job?

Image for post
Image for post
Photo by JESHOOTS.COM on Unsplash

Python programming is like endless possibilities. Web and Internet Development, Desktop GUI based application, Machine Learning, Deep Learning, Software Development, Business Applications development there is a big list of use cases of Python programming. Considering so much is there to learn, it’s easy to get lost and wonder how-much-is-too-much.

In this article, I will take you through different programming paradigms Python supports and help you decide which one you should learn first.

Programming Paradigm is a style or way to classify programming languages.

Python support three programming paradigms: Imperative, Functional, and Object-Oriented. …


It’s easier than you thought!

Image for post
Image for post
Photo by Hitesh Choudhary on Unsplash

In an Object-Oriented Programming (OOP) like Python, class, method, instance, object, play a critical role. In this article, I will take you through different method types used in Python OOP.

Figure 1 explains the Instance Method, Class Methods, and Static Methods in Python. While the Instance method is the most common method Class method and Static method are decorators (note @ sign).


And why data scientists cannot escape it

Image for post
Image for post
Photo by Franki Chamaki on Unsplash

For most of us, the most dreaded part of Data Science and Machine learning is the math and statistics involved in it.

If you’re a scientist, and you have to have an answer, even in the absence of data, you’re not going to be a good scientist.

- Neil deGrasse Tyson

Everyone has there own way of developing their love for data and data science. For me, understanding the basics worked like magic. Once, I mastered the basic concepts like types of data, distribution, and shape of distributions, etc., …

About

Sanjay Singh

A curious IT professional with two decades of experience in Data Science , Machine Learning, Enterprise Resource Planning, and Relational Database!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store