Categories apache-spark

6 posts

Auto Added by WPeMatico

Setting Your Seed Value with Sparklyr

A few months ago, I gave a rather ambitious presentation on reproducible research. Part of this presentation included demonstrating how to analyze data on a Spark cluster using R with the Sparklyr package. When you train as a data scientist, one of the first things you learn is to set a […]

Apache Spark Streaming Simplified

A typical spark streaming data pipeline. The above data flow depicts a typical streaming data pipeline used for streaming data analytics. OK Lets split it up, You need a source and in this example I will use a delimited file as a source for the Kafka topic.There are multiple ways […]

Analyzing Medium’s posts and building a simple prediction service for “Popular on Medium”

These days Medium is a rage everywhere. Blogging has taken the form of stories with more focus on a personal touch. Medium has become the platform to express your views and share them with the worldwide community. I decided to do a quick analysis on the articles and see what […]

Becoming a Data Engineer

As I explained in previous articles, I am a Software Engineer working at HelloFresh. I have been with the company for five years so far. Before HelloFresh and in this company I had different roles. I started my career as a web developer doing everything, something that nowadays is called […]

Machine Learning @ Teads (part 2)

Stack, production workflow and practice In the The first step [1] is the actual service making predictions. The generated application logs are then used, together with other sources of data (DMPs, etc.) to build training sets [2]. For the training jobs [3], these data sets are randomly split into several partitions that […]