Tuesday, December 27, 2016

A Data Engineer's Guide To Non-Traditional Data Storages

Data Engineering

With the rise of big data and data science, many engineering roles are being challenged and expanded. One new-age role is data engineering.
Originally, the purpose of data engineering was the loading of external data sources and the designing of databases (designing and developing pipelines to collect, manipulate, store, and analyze data).
It has since grown to support the volume and complexity of big data. So data engineering now encapsulates a wide range of skills, from web-crawling, data cleansing, distributed computing, and data storage and retrieval.
For data engineering and data engineers, data storage and retrieval is the critical component of the pipeline together with how the data can be used and analyzed.
In recent times, many new and different data storage technologies have emerged. However, which one is best suited and has the most appropriate features for data engineering?
Most engineers are familiar with SQL databases, such as PostgreSQL, MSSQL, and MySQL, which are structured in relational data tables with row-oriented storage.
Given how ubiquitous these databases are, we won’t discuss them today. Instead, we explore three types of alternative data storages that are growing in popularity and that have introduced different approaches to dealing with data.
Within the context of data engineering, these technologies are search engines, document stores, and columnar stores.
  • Search engines excel at text queries. When compared to text matches in SQL databases, such as LIKE, search engines offer higher query capabilities and better performance out of the box.
  • Document stores provide better data schema adaptability than traditional databases. By storing the data as individual document objects, often represented as JSONs, they do not require schema predefining.
  • Columnar stores specialize in single column queries and value aggregations. SQL operations, such as SUM and AVG, are considerably faster in columnar stores, as data of the same column are stored closer together on the hard drive.
In this article, we explore all three technologies: Elasticsearch as a search engine, MongoDB as a document store, and Amazon Redshift as a columnar store.
By understanding alternative data storage, we can choose the most suitable one for each situation.
Storage for Data Engineering: Which is the Best?
For data engineers, the most important aspects of data storages are
how they index, shard, and aggregate data.
To compare these technologies, we’ll examine how they index, shard, and aggregate data.
Each data indexing strategy improves certain queries while hindering others.
Knowing which queries are used most often can influence which data store to adopt.
Sharding, a methodology by which databases divide its data into chunks, determines how the infrastructure will grow as more data is ingested.
Choosing one that matches our growth plan and budget is critical.
Finally, these technologies each aggregate its data very differently.
When we are dealing with gigabytes and terabytes of data, the wrong aggregation strategy can limit the types and performances of reports we can generate.
As data engineers, we must consider all three aspects when evaluating different data storages.

Contenders

Search Engine: Elasticsearch

Elasticsearch quickly gained popularity among its peers for its scalability and ease of integration. Built on top of Apache Lucene, it offers a powerful, out-of-the-box text search and indexing functionality. Aside from the traditional search engine tasks, text search, and exact value queries, Elasticsearch also offers layered aggregation capabilities.

Document Store: MongoDB

At this point, MongoDB can be considered the go-to NoSQL database. Its ease of use and flexibility quickly earned its popularity. MongoDB supports rich and adaptable querying for digging into complex documents. Often-queried fields can be sped up through indexing, and when aggregating a large chunk of data, MongoDB offers a multi-stage pipeline.

Columnar Store: Amazon Redshift

Alongside the growth of NoSQL’s popularity, columnar databases have also gathered attention, especially for data analytics. By storing data in columns instead of the usual rows, aggregation operations can be executed directly from the disk, greatly increasing performance. A few years ago, Amazon rolled out its hosted service for a columnar store called Redshift.

Indexing

Elasticsearch’s Indexing Capability

In many ways, search engines are data stores that specialize in indexing texts.
While other data stores create indices based on the exact values of the field, search engines allow retrieval with only a fragment of the (usually text) field.
By default, this retrieval is done automatically for every field through analyzers.
An analyzer is a module that creates multiple index keys by evaluating the field values and breaking them down into smaller values.
For example, a basic analyzer might examine “the quick brown fox jumped over the lazy dog” into words, such as “the,” “quick,” “brown,” “fox” and so on.
This method enables users to find the data by searching for fragments within the results, ranked by how many fragments match the same document data.
A more sophisticated analyzer could utilize edit distancesn-grams, and filter by stopwords, to build a comprehensive retrieval index.

MongoDB’s Indexing Capability

As a generic data store, MongoDB has a lot of flexibility for indexing data.
Unlike Elasticsearch, it only indexes the _id field by default, and we need to create indices for the commonly queried fields manually.
Compared to Elasticsearch, MongoDB’s text analyzer isn’t as powerful. But it does provide a lot of flexibility with indexing methods, from the compound and geospatial for optimal querying to the TTL and sparse for storage reduction.

Redshift’s Indexing Capability

Unlike Elasticsearch, MongoDB, or even traditional databases, including PostgreSQL, Amazon Redshift does not support an indexing method.
Instead, it reduces its query time by maintaining a consistent sorting on the disk.
As users, we can configure an ordered set of column values as the table sort key. With the data sorted on the disk, Redshift can skip an entire block during retrieval if its value falls outside the queried range, heavily boosting performance.

Sharding

Elasticsearch’s Sharding Capability

Elasticsearch was built on top of Lucene to scale horizontally and be production ready.
Scaling is done by creating multiple Lucene instances (shards) and distributing them across multiple nodes (servers) within a cluster.
By default, each document is routed to its respective shard through its _id field.
During retrieval, the master node sends each shard a copy of the query before finally aggregating and ranking them for output.

MongoDB’s Sharding Capability

Within a MongoDB cluster, there are three types of servers: router, config, and shard.
By scaling the router, servers can accept more requests, but the heavy lifting happens at the shard servers.
As with Elasticsearch, MongoDB documents are routed (by default) via _id to their respective shards. At the query time, the config server notifies the router, which shards the query, and the router server then distributes the query and aggregates the results.

Redshift’s Sharding Capability

An Amazon Redshift cluster consists of one leader node, and several compute nodes.
The leader node handles the compilation and distribution of queries as well as the aggregation of intermediate results.
Unlike MongoDB’s router servers, the leader node is consistent and can’t be scaled horizontally.
While this creates a bottleneck, it also allows efficient caching of compiled execution plans for popular queries.

Aggregating

Elasticsearch’s Aggregating Capability

Documents within Elasticsearch can be bucketed by exact, ranged, or even temporal and geolocation values.
These buckets can be further grouped into finer granularity through nested aggregation.
Metrics, including means and standard deviations, can be calculated for each layer, which provides the ability to calculate a hierarchy of analyses within a single query.
Being a document-based storage, it does suffer the limitation of intra-document field comparisons.
For example, while it is good at filtering if a field followers is greater than 10, we cannot check if followers is greater than another field following.
As an alternative, we can inject scripts as custom predicates. This feature is great for one-off analysis, but performance suffers in production.

MongoDB’s Aggregating Capability

The Aggregation Pipeline is powerful and fast.
As its name suggests, it operates on returned data in a stage-wise fashion.
Each step can filter, aggregate and transform the documents, introduce new metrics, or unwind previously aggregated groups.
Because these operations are done in a stage-wise manner, and by ensuring documents and fields are reduced to only filtered, the memory cost can be minimized. Compared to Elasticsearch, and even Redshift, Aggregation Pipeline is an extremely flexible way to view the data.
Despite its adaptability, MongoDB suffers the same lack of intra-document field comparison as Elasticsearch.
Furthermore, some operations, including $group, require the results to be passed to the master node.
Thus, they do not leverage the distributed computing.
Those unfamiliar with the stage-wise pipeline calculation will find certain tasks unintuitive. For example, summing up the number of elements in an array field would require two steps: first, the $unwind, and then the $group operation.

Redshift’s Aggregating Capability

The benefits of Amazon Redshift cannot be understated.
Frustratingly slow aggregations on MongoDB while analyzing mobile traffic is quickly solved by Amazon Redshift.
Supporting SQL, traditional database engineers will have an easy time migrating their queries to Redshift.
Onboarding time aside, SQL is a proven, scalable, and powerful query language, supporting intra-document/row field comparisons with ease. Amazon Redshift further improves its performance by compiling and caching popular queries executed on the compute nodes.
As a relational database, Amazon Redshift does not have the schema flexibility that MongoDB and Elasticsearch have. Optimized for read operations, it suffers performance hits during updates and deletes.
To maintain the best read time, the rows must be sorted, adding extra operational efforts.
Tailored to those with petabyte-sized problems, it is not cheap and likely not worth the investment unless there are scaling problems with other databases.

Picking the Winner

In this article, we examined three different technologies – Elasticsearch, MongoDB, and Amazon Redshift – within the context of data engineering. However, there is no clear winner as each of these technologies is a front-runner in its storage type category.
For data engineering, depending on the use case, some options are better than others.
  • MongoDB is a fantastic starter database. It provides the flexibility we want when data schema is still to be determined. That said, MongoDB does not outperform specific use cases that other databases specialize in.
  • While Elasticsearch offers a similar fluid schema to MongoDB, it is optimized for multiple indices and text queries at the expense of write performance and storage size. Thus, we should consider migrating to Elasticsearch when we find ourselves maintaining numerous indices in MongoDB.
  • Redshift requires a predefined data schema, and is lacking the adaptability that MongoDB provides. In return, it outclasses other databases for queries only involving single (or a few) columns. When the budget permits, Amazon Redshift is a great secret weapon when others cannot handle the data quantity.
Article via Toptal

Thursday, December 15, 2016

The Zen of devRant

Highly annoying clients:
Non-technical family and friends.
Clueless recruiters.
You all know who I’m talking about – the client who wants you to code a website in GitHub; the partner who thinks your code looks like a bunch of sad winky faces; and the recruiters who want five years Swift experience when Swift is only two years old.
For years, designers have been able to vent about Clients From Hell. Now, it’s developers’ turns to get the frustration off their chests on devRant. For those of you who’ve been living under a rock for the past year, devRant is where developers can, well, [anonymously] rant about all of the above.
Some posts will make you laugh. Others will make you laugh so hard you cry. And just about all of them will make you empathize with the poster.
This post is a culmination of our favorite devRants. We hope you enjoy them as much as we do.

Work

Work
So you almost showed up to work with a positive attitude, but then PITA clients and bosses stepped in and turned that right around.
This section is an ode to those developers with likely the highest blood pressures. Because, hey, they deserve something for dealing with years worth of migraines.

Placebo Designs

By Akrion

Client tests the app and provides feedback ...

This sucks! Full of bugs, hard to navigate, nothing works!

1 week later after version 2 client provides new feedback: This rocks! Love it. Easy to use and rock solid!

Changes made: background set to light blue.

You Said my Website Would be SEO-friendly!

By Benline

Time for a rant!

Got a client I've just built a website for and they went live 2 weeks ago.

This morning he sends me an email saying that the website is not good enough because it's not making any sales or getting any traffic.

I send an email back asking if he has a marketing / SEO company... The response was I thought you do that as you said the site would be SEO friendly!!!

I'm a developer! Not a marketer, fuck off.

Code in GitHub

By saeedjassani

GitHub

Copycat

By Depe

A client wants to make a Pokemon GO type of game.. In two months.. (before the hype ended, they said)

Admit Nothing

By devRat

One of my clients (who's also a coder) asked me today
"Are you on devRant?"
Me: 😶 What's that? 😓

Swipe, Pinch, Landscape, Portrait, Back Pinch, Open New Tab, Close Tab, Ash Cigarette On Phone, Dunk In Toilet, Dry, Double Tap

By Rican2onylee

My boss literally spends half an hour finger-fucking his phone on the mobile site to find "bugs", that I can't replicate. A combination like: swipe, pinch, landscape, portrait, back pinch, open new tab, close tab, ash cigarette on phone, dunk in toilet, dry, double tap... Aha I've found a bug, there's 0.5 pixel line of space between the bag header and the browser bar.

That Time You Tried To Prank Someone, But They Don’t Even Notice

By Silhoutte

I've been slowly increasing the size of my tech manager's mouse cursor over the last month when he leaves his computer unlocked. It's about an inch tall now and he hasn't noticed yet. Everyone else in the office does and it's the best thing ever.

Like a Virgin

By Salimansari

Job offers be like:

We need a VIRGIN with at least 2 years EXPERIENCE of SEX…

You’re a Programmer, You Must Have Secret Backups

By DevRat

Client: Hi. my SEO guy messed up the website. It's kind of .... you know .... gone. You must have the backup. Please restore

Me (after 10 mins): Done

.............

Client: Hi again. I don't see my changes from yesterday. Why?

Me: Because I had 2 months old backup.

Client: Why?

Me: Because that's the last time I worked on your website. And you changed the credentials later on.

Client: But you're a programmer. You must have had a back door to take back ups.

........

Client: Hello?

Me: It's time to leave earth.

We’ll Give You The Details After You Give Us The Quote

By Beofett

We want a web site.

We're going to want lots of interactive content, which we'll define later.

You need to develop the whole thing in 2 weeks.

We'll give you all the details after you tell us exactly how much it will cost.

Keep Calm, And Remember There’s No WiFi In Jail.

By Daumie

Keep calm, and remember there’s no wifi in jail

And that was enough encouragement

Personal

Personal
After a long day’s work, you just want to relax and probably rant about all of the above. Unfortunately, that isn’t what typically happens, especially when you’re living with a non-coder who, as much as they try to, just can’t feel your pain.
In the words of devRant, developers are people too. Yet, all too often devs can feel like a one-man wolfpack.
This section is a collection of rants about the ones you love who just don’t get it.

Night Owl

By Zayy862

There's no greater waste of time than laying in bed with your significant other and waiting on them to fall asleep so you can tip toe back to your computer in order to hit a deadline.

Literally my ritual every night.

Why is Your Code so Sad?

By Tisaconundrum


Me: *coding*
Gf: *walks into room*
Gf: awww look at all the sad winky faces
Me: excuse me?
Gf: look at all the sad winky faces *points at this ); *
Me: ... 😕😂

At Least Everyone Isn’t an Idiot

By rigobertomolina

So I barely get home and I see my 10-year old sister in the living room coding with the Xcode Playground, I asked her where she learned how to do that and she said "I just read the books you had." I'm so proud. 😭

Yet Another Great Opportunity

By Mwemekamp

Some guy my girlfriend knows, heard I'm a software developer. He had this 'great' idea on how he wanted to start a new revolutionary way of paying on the internet. He wanted to create a service like paypal but without having the hassle of logging in first and going through a transaction. He wanted a literal "buy now" button on every major webshop on the internet. When I asked him how he thought that would work legally and security wise, he became a bit defensive and implied that since I'm the tech guy I should work out that kind of stuff. When the software was ready, he would have clients lined up for the service and his work would start.

I politely declined this great opportunity

Truths

Truths
Programming isn’t like a normal job. It’s almost invisible. Yet it’s behind just about every amazing thing being built today, which makes it an indispensable function that the majority of people don’t understand all that well.
In these rants, developers drop some funny rants about the truths only they could appreciate.

These are my Confessions…

By Badboytherock

Dev confession.

Everybody in my department thinks I am a genius programmer. I am just a better googler who knows how to apply things.

That Time You Almost Donated a Website

By Bleestein

Ever visit a site/web app that's so bad you want to build them a new one for free?

Being Efficient is Stressful

By Azous

I'm so stressed lately that even when I try to relax I stress out because I keep trying to relax in the most efficient way possible... Fuck

Telecommuting Sucks

By GinjaNinja

You can work from anywhere... anywhere in the world!

Hmmm... Yeah, right! But not when management likes warm bodies at the office.

I hate, hate, absolutely HATE having to travel to work, spending at least 45min to an hour in traffic just to get to work! 😤😡 And then rinse and repeat to get home... which means I'm up at 5:30 every morning to be at work by 7:30, only to get home past 18:00 - traffic permitting! *sigh* 😩

Have You Heard of Google?

By Stoner

Worst part: being everyone else's Search Bitch. Seriously, how the hell do you have a job in the tech industry when you can't use a fucking search engine, whether it's Google, a builtin search facility or, hell, scrolling down the goddamn page?

Facepalm

Get ready for a double facepalm because one just won’t cut it.

Oh, Junior Devs!

By Mcarrera

On his first week at job, the junior says:

Hey guys! Check out this new website I found! You'll thank me later.
StackOverflow

Oh, Silly Clients...

By Xilo

Client: We need a phone number field on that form ASAP! We paid all this money and have hardly asked for anything. (completely not true btw...) You need to do this NOW!

Me: It has a phone number field...

Client: No it DOES NOT!

Me: Are we talking about the same form? *Sends a screenshot*

Client: Oh, I see it now. lol

Fml. Absolutely ready to give up client work. That was all exactly quoted btw.

Grateful - Thank God (TGI…)

Grateful
Thus far, you’ve read a fair share of rants and moans on the darker end of the spectrum. But let’s be real – being a developer is pretty awesome. It’s certainly not all doom and gloom, especially if you’re a Toptal developer, who works whenever and wherever [s]he wants.
So to level the playing field, I’ve included the best grateful rants I could find.

Amen :p

By Jonsnowstorm

God bless the person who had the same tech issue as you, posted it, got the solution and came back to confirm in on a random forum.

Thank God For StackOverFlow and Google

By PythonRam

I am more thankful to StackOverFlow and Google than I am to my university.

Thank God For Remote Work

By Apieceoffruitt

I just had a nightmare.

I never became a developer. Instead I had a normal 9-5, didn't do work at home, slept well and spent my free time on social activities.

It was horrible.

Thank God for Smart Bosses

By Brod

We will no longer be accepting contracts which have an internet explorer or edge support requirement.

All of the front end devs are going hysterical and celebrating 😂 🎉🎉

Thank God for Good Clients

By Antho

Not a rant....

I received an email today from a client that reads "I'm so anxious to pay you. This is the best money I've spent on my website."

Thank God for Nice Coworkers

By Peamm

A young guy I work with burst into tears today, I had no idea what happened so I tried to comfort him and ask what was up.

It appears his main client had gone nuts with him because they wanted him to make an internet toolbar (think Ask.com) and he politely informed them toolbars doesn't really exist anymore and it wouldn't work on things like modern browsers or mobile devices.

Being given a polite but honest opinion was obviously something the client wasn't used to and knowing the guy was a young and fairly inexperienced, they started throwing very personal insults and asking him exactly what he knows about things (a lot more than them).

So being the big, bold, handsome senior developer I am, I immediately phoned the client back and told them to either come speak to me face-to-face and apologise to him in person or we'd terminate there contract with immediate effect. They're coming down tomorrow...

So part my rant, part a rant on behalf of a young developer who did nothing wrong and was treated like shit, I think we've all been there.

We'll see how this goes! Who the hell wants a toolbar anyway?!

Wisdom

We’ve complained, given thanks and shared some funny truths. To wrap things up, we’re leaving you with the best pearls of wisdom we found on devRant.

Always be efficient

By Avian

I knew I was about to get laid off so I stopped caring and started answering questions on stack overflow all day every day instead of working.

10k rep later I got a new job via stack overflow careers that pays twice as much.

Moral of the story? Be efficient... even when you are not.

Do As I Say, Not As I Do

By G-m-f

Never enter rude / swear words as test data.
At some point you'll forget they are in there until they show up in a client demo.

Reinvent The Damn Wheel

By SaDammie-Jr

Reinventing the wheel can be very valuable. Even if you don't create a better wheel, you'll learn a lot about how it works, which can really help you out in the long term.

[Serious] Reflection

By Edisonn

Worst part of being a dev: most employers want to own ya.

When I was at my first big employer, Microsoft, 10+ years ago, I was not even supposed to touch open source, let alone work on a side business.

Then I quit and joined another software giant, Google, that one was asking me to submit for approval on each open source project I would touch, and it was a 30 day period before I would get approval. And working on a side software business was an absolutely no-no, cause anything competes with them, or so they say.

At my current employer I am allowed to do whatever the hell I want. And they have only one, common sense, restriction: Whatever I do, should not be related to their core business.

I wish I would have not sold my soul when I had lots of time, no kids to take care off, and I was young and energetic.

It takes me now months to make a baby step for my wannabe business.

Permatemp

By Marps1

Nothing is more permanent than a temporary fix.

Now, go Forth, Young Jedi. May the Force be With You.

While these rants may be funny when they’re not happening to you, they’re almost certainly not so when they are.
So the next time someone gives you something to rant about, remember this post, and head over to devRant to get it off your chest.
This article was originally posted on Toptal