Cloud Dynamo

Cloud Dynamo Standardize data engineering pipelines across cloud platforms with a simple unified API.

AWS Lake Formation 2022 year in reviewData governance is the collection of policies, processes, and systems that organiz...
03/08/2023

AWS Lake Formation 2022 year in review

Data governance is the collection of policies, processes, and systems that organizations use to ensure the quality and appropriate handling of their data throughout its lifecycle for the purpose of generating business value. Data governance is increasingly top-of-mind for customers as they recognize data as one of their most important assets. Effective data governance enables better decision-making by improving data quality, reducing data management costs, and ensuring secure access to data for stakeholders. In addition, data governance is required to comply with an increasingly complex regulatory environment with data privacy (such as GDPR and CCPA) and data residency regulations (such as in the EU, Russia, and China).

For AWS customers, effective data governance improves decision-making, increases business agility, provides a competitive advantage, and reduces the risk of fines due to non-compliance with regulatory obligations. We understand the unique opportunity to provide our customers a comprehensive end-to-end data governance solution that is seamlessly integrated into our portfolio of services, and AWS Lake Formation and the AWS Glue Data Catalog are key to solving these challenges.

https://aws.amazon.com/blogs/big-data/aws-lake-formation-2022-year-in-review/

Data governance is the collection of policies, processes, and systems that organizations use to ensure the quality and appropriate handling of their data throughout its lifecycle for the purpose of generating business value. Data governance is increasingly top-of-mind for customers as they recognize...

Build a real-time GDPR-aligned Apache Iceberg data lakeData lakes are a popular choice for today’s organizations to stor...
03/06/2023

Build a real-time GDPR-aligned Apache Iceberg data lake

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. But regulations such as the General Data Protection Regulation (GDPR) have created obligations for data operators who must be able to erase or update personal data from their data lake when requested.

A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment. When a customer asks to erase or update private data, the data lake operator needs to find the required objects in Amazon S3 that contain the required data and take steps to erase or update that data. This activity can be a complex process for the following reasons:

* Data lakes may contain many S3 objects (each may contain multiple rows), and often it’s difficult to find the object containing the exact data that needs to be erased or personally identifiable information (PII) to be updated as per the request

* By nature, S3 objects are immutable and therefore applying direct row-based transactions like DELETE or UPDATE isn’t possible

To handle these situations, a transactional feature on S3 objects is required, and frameworks such as Apache Hudi or Apache Iceberg provide you the transactional feature for upserts in Amazon S3.

https://aws.amazon.com/blogs/big-data/build-a-real-time-gdpr-aligned-apache-iceberg-data-lake/

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. But regulations such as the General Data Protection Regulation (GDPR) have created obligations for data op...

Patterns for enterprise data sharing at scaleData sharing is becoming an important element of an enterprise data strateg...
03/03/2023

Patterns for enterprise data sharing at scale

Data sharing is becoming an important element of an enterprise data strategy.

AWS services like AWS Data Exchange provide an avenue for companies to share or monetize their value-added data with other companies.

Some organizations would like to have a data sharing platform where they can establish a collaborative and strategic approach to exchange data with a restricted group of companies in a closed, secure, and exclusive environment.

For example, financial services companies and their auditors, or manufacturing companies and their supply chain partners. This fosters development of new products and services and helps improve their operational efficiency.

https://aws.amazon.com/blogs/big-data/patterns-for-enterprise-data-sharing-at-scale/

Data sharing is becoming an important element of an enterprise data strategy. AWS services like AWS Data Exchange provide an avenue for companies to share or monetize their value-added data with other companies. Some organizations would like to have a data sharing platform where they can establish a...

Use fuzzy string matching to approximate duplicate records in Amazon RedshiftAmazon Redshift is a fully managed, petabyt...
02/17/2023

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to run complex SQL analytics at scale and performance on terabytes to petabytes of structured and unstructured data, and make the insights widely available through popular business intelligence (BI) and analytics tools.

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources.

One challenge you may face when performing analytics is the presence of imperfect duplicate records within the source data. Answering questions as simple as “How many unique customers do we have?” can be very challenging when the data you have available is like the following table.

https://aws.amazon.com/blogs/big-data/use-fuzzy-string-matching-to-approximate-duplicate-records-in-amazon-redshift/

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources. One challenge you may face when performing analytics is ...

Scaling rapidly with AWS—How SEON achieved 3x growth for 3 years runningScaling a startup successfully involves increasi...
02/16/2023

Scaling rapidly with AWS—How SEON achieved 3x growth for 3 years running

Scaling a startup successfully involves increasing profit margins exponentially while keeping costs low. Most startups combine a variety of approaches to scale, based on their growth stage and needs. Techniques to scale include finding processes that work and applying them across the board, focusing on customers and building a product that is in high demand, and harnessing AWS cloud technology to move fast and optimize your costs.

SEON, a Hungarian fraud prevention startup founded by Tamás Kádár and Bence Jendruszák in 2017, is a model of successful startup scaling: Without major refactors of their architecture, SEON has scaled rapidly for three consecutive years, achieving triple growth each year by building on cloud services offered by AWS. In 2021 alone, SEON more than tripled its annual recurring revenue, grew its headcount by 4X, and opened new offices in Austin, Texas and Jakarta, Indonesia.

https://aws.amazon.com/blogs/startups/scaling-rapidly-with-aws-how-seon-achieved-3x-growth-for-3-years-running/

Scaling a startup successfully involves increasing profit margins exponentially while keeping costs low. Most startups combine a variety of approaches to scale, based on their growth stage and needs. Techniques to scale include finding processes that work and applying them across the board, focusing...

Build a real-time fraud detection solution using Amazon Neptune MLEach year online businesses lose tens of billions of d...
02/15/2023

Build a real-time fraud detection solution using Amazon Neptune ML

Each year online businesses lose tens of billions of dollars due to fraud, which can take many forms. For example, fraudsters can obtain stolen credit card details and use them for unauthorized transactions.

Therefore, detecting fraud and malicious behavior at the time of a transaction, such as when a user registers a new payment method, is necessary for working to prevent these fraud-related losses. The following diagram shows a fraud detection use case where a business predicts if a purchase request using a credit card is fraudulent or not based on data on known fraud.

https://aws.amazon.com/blogs/database/build-a-real-time-fraud-detection-solution-using-amazon-neptune-ml/

Each year online businesses lose tens of billions of dollars due to fraud, which can take many forms. For example, fraudsters can obtain stolen credit card details and use them for unauthorized transactions. Therefore, detecting fraud and malicious behavior at the time of a transaction, such as when...

Is Big Data Dead?For more than a decade now, the fact that people have a hard time gaining actionable insights from thei...
02/15/2023

Is Big Data Dead?

For more than a decade now, the fact that people have a hard time gaining actionable insights from their data has been blamed on its size. “Your data is too big for your puny systems,” was the diagnosis, and the cure was to buy some new fancy technology that can handle massive scale. Of course, after the Big Data task force purchased all new tooling and migrated from Legacy systems, people found that they still were having trouble making sense of their data. They also may have noticed, if they were really paying attention, that data size wasn’t really the problem at all.

The world in 2023 looks different from when the Big Data alarm bells started going off. The data cataclysm that had been predicted hasn’t come to pass. Data sizes may have gotten marginally larger, but hardware has gotten bigger at an even faster rate. Vendors are still pushing their ability to scale, but practitioners are starting to wonder how any of that relates to their real world problems.

https://motherduck.com/blog/big-data-is-dead

Big data is dead. Long live easy data.

Building serverless on AWS to scale Ramp’s fast-growing finance automation platformFor startups, coming full circle is a...
02/14/2023

Building serverless on AWS to scale Ramp’s fast-growing finance automation platform

For startups, coming full circle is a milestone defined by partnering with the programs used during early stage growth, or providing resources that help other startups succeed as well.

Ramp, a B2B fintech startup founded in 2019 by veteran founders Eric Glyman and Karim Atiyeh, does both. Ramp is a tech-first finance automation platform whose serverless modern application–in conjunction with its corporate card–allows businesses to more efficiently manage their finances.

https://aws.amazon.com/blogs/startups/building-serverless-on-aws-to-scale-ramps-fast-growing-finance-automation-platform/

For startups, coming full circle is a milestone defined by partnering with the programs used during early stage growth, or providing resources that help other startups succeed as well. Ramp, a B2B fintech startup founded in 2019 by veteran founders Eric Glyman and Karim Atiyeh, does both. Ramp is a....

Automate schema evolution at scale with Apache Hudi in AWS GlueIn the data analytics space, organizations often deal wit...
02/13/2023

Automate schema evolution at scale with Apache Hudi in AWS Glue

In the data analytics space, organizations often deal with many tables in different databases and file formats to hold data for different business functions. Business needs often drive table structure, such as schema evolution (the addition of new columns, removal of existing columns, update of column names, and so on) for some of these tables in one business function that requires other business functions to replicate the same.

This post focuses on such schema changes in file-based tables and shows how to automatically replicate the schema evolution of structured data from table formats in databases to the tables stored as files in cost-effective way.

https://aws.amazon.com/blogs/big-data/automate-schema-evolution-at-scale-with-apache-hudi-in-aws-glue/

In the data analytics space, organizations often deal with many tables in different databases and file formats to hold data for different business functions. Business needs often drive table structure, such as schema evolution (the addition of new columns, removal of existing columns, update of colu...

FinOps: Four Ways to Reduce Your BigQuery Storage CostDon’t overlook the cloud storage costWith the current state of the...
02/10/2023

FinOps: Four Ways to Reduce Your BigQuery Storage Cost

Don’t overlook the cloud storage cost

With the current state of the economic situation, it’s more important than ever to maximize our cash on hand and develop a series of cost optimization strategies. The growing use of cloud services has brought not only many opportunities for the business but also the potential for management challenges that can lead to cost overruns and other issues.

https://towardsdatascience.com/finops-four-ways-to-reduce-your-bigquery-storage-cost-82d99c47f139

Don’t overlook the cloud storage cost

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting...
02/10/2023

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started

AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. AWS Glue provides an extensible architecture that enables users with different data processing use cases.

A common use case is building data lakes on Amazon Simple Storage Service (Amazon S3) using AWS Glue extract, transform, and load (ETL) jobs. Data lakes free you from proprietary data formats defined by the business intelligence (BI) tools and limited capacity of proprietary storage. In addition, data lakes help you break down data silos to maximize end-to-end data insights. As data lakes have grown in size and matured in usage, a significant amount of effort can be spent keeping the data up to date by ensuring files are updated in a transactionally consistent manner.

https://aws.amazon.com/blogs/big-data/part-1-getting-started-introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark/

AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. AWS Glue provides an extensible architecture that enables users with different data processing use cases. A common use case is building data lakes on...

Address

Dallas, TX

Alerts

Be the first to know and let us send you an email when Cloud Dynamo posts news and promotions. Your email address will not be used for any other purpose, and you can unsubscribe at any time.

Share