Help us (and others) reduce the cost of cloud-hosted database clusters

From: Ben Chobot <bench(at)instructure(dot)com>
To: pgsql-jobs(at)postgresql(dot)org
Subject: Help us (and others) reduce the cost of cloud-hosted database clusters
Date: 2020-12-18 23:56:14
Message-ID: d9c677f3-2b4a-a65f-f830-d367723dba91@instructure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-jobs

Hello Postgres community,

Instructure is looking to commission a bit of open source software that
manages the cost of a database cluster in a cloud environment.

Specifically, we have hundreds of primary/secondary postgres clusters in
AWS' EC2, which are busy during the day and not so busy at night. We
would like to shrink the instance size of the secondary db once its load
has reliably fallen below a threshold and its load has been shifted to
the primary, and then undo all that before the morning rush. (We don't
want to stop it, because, while that would save even more money, we need
it to keep streaming.) The project is generally pretty easy, but there
are some wrinkles that make it not quiteas straightforward as it sounds:

1. We have lots of historical data to help us predict the time of day
that these thresholds should be crossed and the appropriate action
should be taken, but load patterns shift over time and the model will
need to be continually updated. Also, there is always the chance some
organic client load comes along which forces us to enlarge our secondary
DB earlier than we otherwise would. Determining how wide the spike of
unexpected client load would need to be to force such a premature
embiggening of the secondary is going to need some user-adjustable
knobs. Additionally, we run a lot of maintenance work at night (think
pg_repack) and we would not want to bother enlarging our secondary for
this kind of non-client load. So there will need to be some
user-adjustable filters to try to ignore this expected load when
determining if a premature embiggening should happen.

2. Occasionally we will have planned client workloads in our normally
idle times. In those cases, we'd want to be able to easily schedule this
system notto shrink anything for those windows. Sometimes we know about
these events weeks in advance, so we'll want to be able to easily manage
the list of upcoming exception windows, because if something is far
enough into the future, we will doubtlessly wonder several times if
we've remembered to take it into account.

3. When embiggening the secondary node back to its proper size, it might
that AWS EC2 has no more instances of the desired type available. There
are almost always multiple alternative instances sizes that would be
sufficient, if not as cheap, but the system will need to have the smarts
to consult a ranked table of choices when necessary.

4. Before shrinking a secondary db, we would want to alter a DNS cname
to shift load from it to the primary, and not proceed until client load
has been entirely migrated. When we inflate the secondary in the
morning, we would need to change the cname back, but only after
streaming replication has resumed. In our case, changing DNS is a simple
command line invocation.

5. Most of our postgres clusters have one secondary, but some have
multiple. There would need to be flexibility in how many secondaries
each cluster has, and the proper calculations made to know how much load
would be migrating to the primary come shrinkage time.

6. Long running transactions on the secondary will need to be addressed;
fundamentally, they'll have to be cancelled, but ideally different kinds
of queries might get different runways.
https://gitlab.com/depesz/pg_terminator/-/tree/master provides a good
possible framework.

Required Deliverables

We need software that does all of the above. It needs to run on Ubuntu
Linux, and be written in Ruby, bash, perl, Python, or Go (the languages
our team has the most knowledge in). We expect a unit test framework. We
will be hosting the project in Instructure's Github account, and will be
happy to accept patches from the public.

As an open-source project, it is easy to conceive of a system that would
do all this with a plugin architecture. Of course, the system we are
commissioning would need to support Instructure's needs, but could
easily be extended to also support AWS RDS databases, or MySQL
databases, or Azure, or GCP, etc. It could have callbacks for arbitrary
actions to take before shrinking or enlarging a secondary, or various
ways to compute times to take action, or a variety of ways to manage
exception windows to not take action, and so on. We are only interested
in paying for a system that works for us, but if a little abstraction
can be slipped in to make this system easily extensible for others, so
much the better.

Timeline

*

Now till Jan 8, 2021 - Solicit proposals

*

Jan 11-Jan 15 - Ask us clarifying questions

*

Jan 19 - Deadline for proposals

*

Jan 22 - Select a vendor and move forward

* At the end of the project, two weeks for us to validate everything
works as expected

Required Elements of Proposal

*

Sketch of architecture

*

Timeline

*

Samples of previous open source projects

*

Customer references

Evaluation Criteria of Proposal

*

Experience with Postgres

*

Experience with AWS

*

Experience with open source projects

* Elegance of proposed architecture

Browse pgsql-jobs by date

  From Date Subject
Next Message Aurelia Dimaggio 2021-01-04 16:58:36 Database Engineer (PostgreSQL) - Berlin, Germany
Previous Message pinker 2020-12-08 04:15:55 Re: PostgreSQL Tech Specialist at Microsoft. APAC, India or Australia