How to become big data – data analyst

Anyone who works in the tech industry is aware of the rising demand of Analytics/ Machine learning professionals. More and more organisations have been jumping on to the data driven decision making bandwagon, thereby accumulating loads of data pertaining to their business. In order to make sense of all the data gathered, organisations will require Big Data Analysts to decipher the data.

  Data Analysts have traditionally worked with pre formatted data, that was served by the IT departments, to perform analysis. But with the need for real time or near-real time Analytics to serve end customers better and faster, analysis needs to be performed faster, thereby making the dependency on IT departments a bottleneck. Analysts are required to understand data streams that ingest millions of records into databases or file systems, Lambda architecture and batch processing of data to understand the influx of data.

Also analysing larger amounts of data requires skills that range from understanding the business complexities, the market and the competitors to a wide range of technical skills in data extraction, data cleaning and transformation, data modelling and statistical methods.

Analytics being a relatively new field, is struggling to resource the market demands with highly skilled Big Data Analysts. Being a Big Data Analyst requires a thorough understanding of data architecture and the data flow from source systems into the big data platform. One can always stick to a specific industry domain and specialize within that, for example Healthcare Analytics, Marketing Analytics, Financial Analytics, Operations Analytics, People Analytics, Gaming Analytics etc. But mastering the end-to-end data chain management can lead to plenty of opportunities, irrespective of industry domain.

The entire Data and Analytics suite includes the following gamut of stages:

  • Data integrations – connecting disparate data sources
  • Data security and governance – ensuring data integrity and access rights
  • Master data management – ensuring consistency and uniformity of data
  • Data Extraction, Transformation and Loading – making raw data business user friendly
  • Hadoop and HDFS – big data storage mechanisms
  • SQL/ Hive / Pig – data query languages
  • R/ Python –  for data analysis and mining programming languages
  • Data science algorithms like Naive Bayes, K-means, AdaBoost etc. – Machine learning algorithms for clustering, classification
  • Data Architecture – solutionizing all the above in an optimized way to deliver business insights

The new age data analysts or a versatile Big Data Analyst is one who understands the complexity of data integrations using APIs or connectors or ETL (Extraction, Transformation and Loading), designs data flow from disparate systems keeping in mind data security and quality issues, can code in SQL or Hive and R or Python and is well acquainted with the machine learning algorithms and has a knack at understanding business complexities.

Since Big Data and Analytics is constantly evolving, it is imperative for anyone aiming at a career within the same, to be well versed with the latest tech stack and architectural breakthroughs. Some ways of doing so:

  • Following knowledgeable industry leaders or big data thought leaders on Twitter
  • Joining Big Data related groups on LinkedIn
  • Following Big Data influencers on LinkedIn
  • Attending events, conferences and seminars on Big Data
  • Connecting with peers within the Big Data industry
  • Last but not the least (probably the most important) enrolling in MOOC (Massive Open Online Course) and/ or Big Data books

Since Analytics is a vast field, encompassing several operations, one could choose to specialise in parts of the Analytics chain like data engineers – specializing in highly scalable data management systems or data scientists specializing in machine learning algorithms or data architects – specializing in the overall data integrations, data flow and storage mechanisms. But in order to excel and future proof a career in the world of Big Data, one needs to master more than one area. A data analyst who is acquainted with all the steps involved in data analysis from data extraction to insights is an asset to any organization and will be much sought after!

Recommendation Systems

Recommendation systems have changed the way people shop online, find books, movies or music, news articles go viral or find friends and work mates on Linkedin. The recommendation systems analyze the browsing patterns on websites, ratings or most popular items at that point of time or the products saved in ones virtual basket to recommend products. Similarly, the common interests, work skills or common geographical locations are used to predict people, that you might want to connect with on social media sites.

Behind such personalized recommendation systems lie big data platforms including software, hardware and algorithms that analyze customer behavior and push recommended products, in real time. The big data platforms handle both data and event data distribution and computation. Data can pertain to how customers or customers similar to the one in question, have rated products in the past while event data could be tracking mouse clicks that trigger events for example viewing a product and sometimes both of the above need to be combined to be able to predict a customer’s choice. Hence, the recommendation system architecture caters to data storage for offline analysis as well as low latency computational needs and a combination of the two.

The data platform architecture needs to be robust enough to ingest continuous real time data streams into scalable systems like Hadoop HBASE or any other big data data storage infrastructure like AWS Redshift. Apache Kafka is usually used as the messaging system for the real time data stream in combination with Apache Storm. Due to high throughput data redundancy needs to be taken care of, in case of failures. If the real time computation needs to take into account customer data like previous purchase history, preferences, products already bought , segmentation based on socio economic demographics or data from ERP, CRM, in that case either all the systems have to be available online to be able to blend the data in real time or the customer detail data could be mashed up, offline to create Single Customer View and queried in combination with the real time event data.

The valueable assests of any organisation are customers,products and now, data. Machine learning algorithms combine the three assets together to leverage business gains and predictive analytics is imperative in being proactive to customer needs. Some of the algorithms used for recommendation engines are content-based filtering, collaborative filtering, dimensionality reduction, Kmeans and matrix factorization techniques. The challenge is not the data storage, with wide availability of highly scalable data storage platforms, but the speed with which the data needs to be analyzed in case of recommendation systems. The best approach is to combine mostly precomputed data with fresh event data using pre modelled algorithms to push personalised recommendations to the customer interface.