Amazon’s super-duper data pipeline is now ready for its close-up

Customers interested  in trying out Amazon’s spanky new Data Pipeline can now sign up for the service. Last month, Amazon pitched  the service as an easy way for customers to consolidate data from multiple repositories — both inside and outside of Amazon Web Services — and put them in one place where they can run big batches of analytics and reporting. It’s not a stretch to guess that Amazon hopes those customers will use its new Redshift data warehouse service for those analytics purposes. The Data Pipeline sign-up news was disclosed Friday on the Amazon Web Services blog

As is usually the case with AWS, there is a free tier of usage available for those wanting to test the waters:

awsfreetier

And then there’s a paid tier for production workloads:

awspaidtier

In announcing Data Pipeline plans at AWS: Reinvent, last month, Amazon CTO Werner Vogels painted it as a way to help customers create automated and scheduled workflows of data — from Amazon’s own DynamoDB database service or S3 storage to Elastic MapReduce or wherever the data is needed (here’s where Redshift comes in.)   He promised pre-integration with AWS data sources and “easy connection” to third-party and on-premise data sources as well. It’s not clear from the post what connectivity there is to those third-party data sources now although there is mention of  copying on-premises MySQL on the list of Data Pipeline templates.

awsdatasources

.

Also new from AWS: Fat, new instance types

On Friday, Amazon also announced a new “high storage EC2 instance family”  tailored for data-intensive jobs that need lots of storage density and fast sequential I/O handling. Such applications include data warehousing (hello again Redshift), log processing etc.

According to the blog post announcing the new family:

“High Storage Eight Extra Large (hs1.8xlarge) instances are a great fit for applications that require high storage depth and high sequential I/O performance. Each instance includes 120 GiB of RAM, 16 virtual cores (providing 35 ECU of compute performance), and 48 TB of instance storage across 24 hard disk drives capable of delivering up to 2.4 GB per second of I/O performance.”

The new instances are available immediately from AWS’ US East facility and will roll out to other regions later. Pricing for on-demand instances starts at $ 4.60 per hour but users can also buy one- and three-year reserved instances with prices listed on the EC2 pricing page.

Both the Data Pipeline and fat new instances show that, as customer applications continue to generate tons of data — both relational and non-relational — Amazon is determined to attack the biggest and some of the toughest big data applications around. A combination of the Data Pipeline and Redshift, if it works as advertised, could mean serious problems for big, pricey data warehouse solutions from Teradata, Oracle, and Hewlett-Packard /Vertica


GigaOM