阅读量:2520 次

本文共 15800 字,大约阅读时间需要 52 分钟。


数据工程 (Data Engineering)

With the rise of and data science, many engineering roles are being challenged and expanded. One new-age role is .

随着和数据科学的兴起,许多工程角色正在受到挑战和扩展。 一个新时代的角色是 。

Originally, the purpose of data engineering was the loading of external data sources and the designing of databases (designing and developing pipelines to collect, manipulate, store, and analyze data).


It has since grown to support the volume and complexity of big data. So data engineering now encapsulates a wide range of skills, from web-crawling, data cleansing, distributed computing, and data storage and retrieval.

从那以后,它已经发展为支持大数据的数量和复杂性。 因此,数据工程现在囊括了从爬网,数据清理,分布式计算以及数据存储和检索到的广泛技能。

For data engineering and data engineers, data storage and retrieval is the critical component of the pipeline together with how the data can be used and analyzed.


In recent times, many new and different data storage technologies have emerged. However, which one is best suited and has the most appropriate features for data engineering?

最近,出现了许多新的和不同的数据存储技术。 但是,哪一个最适合并且具有最适合数据工程的功能?

Most engineers are familiar with SQL databases, such as PostgreSQL, MSSQL, and MySQL, which are structured in relational data tables with row-oriented storage.


Given how ubiquitous these databases are, we won’t discuss them today. Instead, we explore three types of alternative data storages that are growing in popularity and that have introduced different approaches to dealing with data.

鉴于这些数据库无处不在,我们今天不再讨论它们。 取而代之的是,我们探索三种类型的替代数据存储,它们日益流行并且引入了处理数据的不同方法。

Within the context of data engineering, these technologies are search engines, document stores, and columnar stores.


  • Search engines excel at text queries. When compared to text matches in SQL databases, such as LIKE, search engines offer higher query capabilities and better performance out of the box.
  • Document stores provide better data schema adaptability than traditional databases. By storing the data as individual document objects, often represented as JSONs, they do not require schema predefining.
  • Columnar stores specialize in single column queries and value aggregations. SQL operations, such as SUM and AVG, are considerably faster in columnar stores, as data of the same column are stored closer together on the hard drive.
  • 搜索引擎擅长文本查询。 与SQL数据库(例如LIKE文本匹配进行比较时,搜索引擎提供了更高的查询功能和更好的开箱即用性能。
  • 与传统数据库相比, 文档存储提供了更好的数据模式适应性。 通过将数据存储为通常表示为JSON的单个文档对象,它们不需要架构预定义。
  • 列式存储专门用于单列查询和值聚合。 在列存储中,SQL操作(例如SUMAVG )的速度要快得多,因为同一列的数据在硬盘上的存储距离更近。

In this article, we explore all three technologies: as a search engine, as a document store, and as a columnar store.

在本文中,我们探索了所有三种技术: 作为搜索引擎, 作为文档存储以及作为列式存储。

By understanding alternative data storage, we can choose the most suitable one for each situation.


Storage for Data Engineering: Which is the Best?

For data engineers, the most important aspects of data storages arehow they index, shard, and aggregate data.


To compare these technologies, we’ll examine how they index, shard, and aggregate data.


Each data indexing strategy improves certain queries while hindering others.


Knowing which queries are used most often can influence which data store to adopt.


Sharding, a methodology by which databases divide its data into chunks, determines how the infrastructure will grow as more data is ingested.


Choosing one that matches our growth plan and budget is critical.


Finally, these technologies each aggregate its data very differently.


When we are dealing with gigabytes and terabytes of data, the wrong aggregation strategy can limit the types and performances of reports we can generate.


As data engineers, we must consider all three aspects when evaluating different data storages.


竞争者 (Contenders)

搜索引擎:Elasticsearch (Search Engine: Elasticsearch)

Elasticsearch quickly gained popularity among its peers for its scalability and ease of integration. Built on top of , it offers a powerful, out-of-the-box text search and indexing functionality. Aside from the traditional search engine tasks, text search, and exact value queries, Elasticsearch also offers layered aggregation capabilities.

Elasticsearch的可伸缩性和易于集成性很快在同行中获得欢迎。 它基于构建,提供了功能强大的即用型文本搜索和索引功能。 除了传统的搜索引擎任务,文本搜索和精确值查询之外,Elasticsearch还提供分层的聚合功能。

文档商店:MongoDB (Document Store: MongoDB)

At this point, MongoDB can be considered the go-to NoSQL database. Its ease of use and flexibility quickly earned its popularity. MongoDB supports rich and adaptable querying for digging into complex documents. Often-queried fields can be sped up through indexing, and when aggregating a large chunk of data, MongoDB offers a multi-stage pipeline.

此时,可以将MongoDB视为NoSQL数据库。 它的易用性和灵活性很快赢得了人们的欢迎。 MongoDB支持丰富且适应性强的查询,可用于挖掘复杂的文档。 经常查询的字段可以通过建立索引来加快速度,当聚合大量数据时,MongoDB提供了多级管道。

柱状商店:Amazon Redshift (Columnar Store: Amazon Redshift)

Alongside the growth of NoSQL’s popularity, columnar databases have also gathered attention, especially for data analytics. By storing data in columns instead of the usual rows, aggregation operations can be executed directly from the disk, greatly increasing performance. A few years ago, Amazon rolled out its hosted service for a columnar store called Redshift.

随着NoSQL的普及,列式数据库也引起了人们的关注,尤其是在数据分析方面。 通过将数据存储在列而不是通常的行中,可以直接从磁盘执行聚合操作,从而大大提高了性能。 几年前,亚马逊为名为Redshift的柱状商店推出了托管服务。

索引编制 (Indexing)

Elasticsearch的索引能力 (Elasticsearch’s Indexing Capability)

In many ways, search engines are data stores that specialize in indexing texts.


While other data stores create indices based on the exact values of the field, search engines allow retrieval with only a fragment of the (usually text) field.


By default, this retrieval is done automatically for every field through analyzers.


An analyzer is a module that creates multiple index keys by evaluating the field values and breaking them down into smaller values.


For example, a basic analyzer might examine “the quick brown fox jumped over the lazy dog” into words, such as “the,” “quick,” “brown,” “fox” and so on.

例如,一个基本的分析器可能会检查“跃过懒惰的狗的快速褐狐狸”成单词,例如“ the”,“ quick”,“ brown”,“ fox”等。

This method enables users to find the data by searching for fragments within the results, ranked by how many fragments match the same document data.


A more sophisticated analyzer could utilize , , and filter by , to build a comprehensive retrieval index.

功能更强大的分析器可以利用 , 和过滤功能来构建全面的检索索引。

MongoDB的索引功能 (MongoDB’s Indexing Capability)

As a generic data store, MongoDB has a lot of flexibility for indexing data.


Unlike Elasticsearch, it only indexes the _id field by default, and we need to create indices for the commonly queried fields manually.


Compared to Elasticsearch, MongoDB’s text analyzer isn’t as powerful. But it does provide a lot of flexibility with indexing methods, from the compound and geospatial for optimal querying to the TTL and sparse for storage reduction.

与Elasticsearch相比,MongoDB的文本分析器没有那么强大。 但是它确实为索引方法提供了很大的灵活性,从复合和地理空间(用于最佳查询)到TTL和稀疏(用于减少存储)。

Redshift的索引能力 (Redshift’s Indexing Capability)

Unlike Elasticsearch, MongoDB, or even traditional databases, including PostgreSQL, Amazon Redshift does not support an indexing method.

与Elasticsearch,MongoDB或什至包括PostgreSQL传统数据库不同,Amazon Redshift不支持索引方法。

Instead, it reduces its query time by maintaining a consistent sorting on the disk.


As users, we can configure an ordered set of column values as the table sort key. With the data sorted on the disk, Redshift can skip an entire block during retrieval if its value falls outside the queried range, heavily boosting performance.

作为用户,我们可以配置一组有序的列值作为表排序键。 通过将数据排序到磁盘上,如果Redshift的值超出查询范围,则Redshift可以跳过整个块,从而极大地提高了性能。

分片 (Sharding)

Elasticsearch的分片能力 (Elasticsearch’s Sharding Capability)

Elasticsearch was built on top of Lucene to scale horizontally and be production ready.


Scaling is done by creating multiple Lucene instances (shards) and distributing them across multiple nodes (servers) within a cluster.


By default, each document is routed to its respective shard through its _id field.


During retrieval, the master node sends each shard a copy of the query before finally aggregating and ranking them for output.


MongoDB的分片能力 (MongoDB’s Sharding Capability)

Within a MongoDB cluster, there are three types of servers: router, config, and shard.


By scaling the router, servers can accept more requests, but the heavy lifting happens at the shard servers.


As with Elasticsearch, MongoDB documents are routed (by default) via _id to their respective shards. At the query time, the config server notifies the router, which shards the query, and the router server then distributes the query and aggregates the results.

与Elasticsearch一样,(默认情况下)MongoDB文档通过_id路由到它们各自的分片。 在查询时,配置服务器通知路由器,该路由器将查询分片,然后路由器服务器分发查询并汇总结果。

Redshift的分片能力 (Redshift’s Sharding Capability)

An Amazon Redshift cluster consists of one leader node, and several compute nodes.

Amazon Redshift集群由一个领导者节点和几个计算节点组成。

The leader node handles the compilation and distribution of queries as well as the aggregation of intermediate results.


Unlike MongoDB’s router servers, the leader node is consistent and can’t be scaled horizontally.


While this creates a bottleneck, it also allows efficient caching of compiled execution plans for popular queries.


汇总 (Aggregating)

Elasticsearch的汇总能力 (Elasticsearch’s Aggregating Capability)

Documents within Elasticsearch can be bucketed by exact, ranged, or even temporal and geolocation values.


These buckets can be further grouped into finer granularity through nested aggregation.


Metrics, including means and standard deviations, can be calculated for each layer, which provides the ability to calculate a hierarchy of analyses within a single query.


Being a document-based storage, it does suffer the limitation of intra-document field comparisons.


For example, while it is good at filtering if a field followers is greater than 10, we cannot check if followers is greater than another field following.


As an alternative, we can inject scripts as custom predicates. This feature is great for one-off analysis, but performance suffers in production.

或者,我们可以将脚本作为自定义谓词注入。 此功能非常适合一次性分析,但会影响生产性能。

MongoDB的汇总能力 (MongoDB’s Aggregating Capability)

The is powerful and fast.


As its name suggests, it operates on returned data in a stage-wise fashion.


Each step can filter, aggregate and transform the documents, introduce new metrics, or unwind previously aggregated groups.


Because these operations are done in a stage-wise manner, and by ensuring documents and fields are reduced to only filtered, the memory cost can be minimized. Compared to Elasticsearch, and even Redshift, Aggregation Pipeline is an extremely flexible way to view the data.

因为这些操作是按阶段进行的,并且通过确保将文档和字段减少为仅过滤的方式,所以可以将内存成本降至最低。 与Elasticsearch甚至Redshift相比,Aggregation Pipeline是一种非常灵活的数据查看方式。

Despite its adaptability, MongoDB suffers the same lack of intra-document field comparison as Elasticsearch.


Furthermore, some operations, including $group, require the results to be passed to the master node.

此外,某些操作(包括$group )要求将结果传递到主节点。

Thus, they do not leverage the distributed computing.


Those unfamiliar with the stage-wise pipeline calculation will find certain tasks unintuitive. For example, summing up the number of elements in an array field would require two steps: first, the $unwind, and then the $group operation.

那些不熟悉阶段式管道计算的人会发现某些任务并不直观。 例如,对一个数组字段中的元素数量求和需要两个步骤:首先是$unwind ,然后是$group操作。

Redshift的汇总能力 (Redshift’s Aggregating Capability)

The benefits of Amazon Redshift cannot be understated.

不可低估Amazon Redshift的好处。

Frustratingly slow aggregations on MongoDB while analyzing mobile traffic is quickly solved by Amazon Redshift.

Amazon Redshift快速解决了在分析移动流量时MongoDB上令人沮丧的缓慢聚合。

Supporting SQL, traditional database engineers will have an easy time migrating their queries to Redshift.


Onboarding time aside, SQL is a proven, scalable, and powerful query language, supporting intra-document/row field comparisons with ease. Amazon Redshift further improves its performance by compiling and caching popular queries executed on the compute nodes.

除了入门时间以外,SQL是一种行之有效,可扩展且功能强大的查询语言,可轻松支持文档内/行字段比较。 Amazon Redshift通过编译和缓存在计算节点上执行的流行查询来进一步提高其性能。

As a relational database, Amazon Redshift does not have the schema flexibility that MongoDB and Elasticsearch have. Optimized for read operations, it suffers performance hits during updates and deletes.

作为关系数据库,Amazon Redshift没有MongoDB和Elasticsearch拥有的模式灵活性。 针对读取操作进行了优化,在更新和删除过程中会遭受性能下降。

To maintain the best read time, the rows must be sorted, adding extra operational efforts.


Tailored to those with petabyte-sized problems, it is not cheap and likely not worth the investment unless there are scaling problems with other databases.


选择优胜者 (Picking the Winner)

In this article, we examined three different technologies – Elasticsearch, MongoDB, and Amazon Redshift – within the context of data engineering. However, there is no clear winner as each of these technologies is a front-runner in its storage type category.

在本文中,我们在数据工程的背景下研究了三种不同的技术-Elasticsearch,MongoDB和Amazon Redshift。 但是,尚无明确的赢家,因为每种技术在其存储类型类别中都是领先者。

For data engineering, depending on the use case, some options are better than others.


  • MongoDB is a fantastic starter database. It provides the flexibility we want when data schema is still to be determined. That said, MongoDB does not outperform specific use cases that other databases specialize in.
  • While Elasticsearch offers a similar fluid schema to MongoDB, it is optimized for multiple indices and text queries at the expense of write performance and storage size. Thus, we should consider migrating to Elasticsearch when we find ourselves maintaining numerous indices in MongoDB.
  • Redshift requires a predefined data schema, and is lacking the adaptability that MongoDB provides. In return, it outclasses other databases for queries only involving single (or a few) columns. When the budget permits, Amazon Redshift is a great secret weapon when others cannot handle the data quantity.
  • MongoDB是一个很棒的入门数据库。 当仍要确定数据模式时,它提供了我们想要的灵活性。 也就是说,MongoDB的性能不会超过其他数据库专门研究的特定用例。
  • 尽管Elasticsearch提供了与MongoDB类似的流畅模式,但它针对多个索引和文本查询进行了优化,但会降低写入性能和存储大小。 因此,当我们发现自己在MongoDB中维护大量索引时,应该考虑迁移到Elasticsearch。
  • Redshift需要预定义的数据模式,并且缺少MongoDB提供的适应性。 作为回报,对于仅涉及单列(或几列)的查询,其性能优于其他数据库。 在预算允许的情况下,当其他人无法处理数据量时,Amazon Redshift是一个很好的秘密武器。

关于作者 (About the author)

member since October 19, 2015





