Backup and Recovery
The Remote Sampler system has been designed to provide high levels of availability and fault-tolerance. In the event of an unexpected disaster scenario, it is important that system availability is restored quickly and that data loss is minimised. This page outlines the measures that are taken to ensure the safety of data stored on hosted cloud implementations of Remote Sampler along with the expectations around both Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Each Remote Sampler environment is made up of a stack of cloud infrastructure with components that have different requirements around both availability and recovery. The following information relates specifically to production instances of the Remote Sampler system.
Cloud Components
Application Server
The application server runs two different externally-facing RESTful APIs. One API is used by the mobile devices in the field and the web client and the second API is used by the adaptor service (locally installed on customer infrastructure). The application server also runs the main Remote Sampler processing service.
The application server has been designed to be transient in nature. No customer data are stored on the application server. The only data that are persisted on the application server are log files for each of the APIs and services described above.
In the event of a disaster scenario, the application server is part of an AWS auto-scaling group that is configured to automatically spin up a new instance in a different AWS availability zone.
Please note that the automated recovery functionality is cross-availability-zone and not cross-region. For more information on availability zones and regions, please refer to the AWS documentation here.
During this process, the system will be unavailable for data transfer activities but users with work already on their mobile devices are able to continue to complete work as normal and any completed work will be queued for transfer as soon as service is resumed. It will not be possible to assign new work to the field users until the issue is resolved. It will not be possible to send data back to external systems (e.g. LIMS) until the issue is resolved.
In the case that a new application server has to be spun up, the system outage will usually last approximately 15-20 minutes. This is typically not long enough that any end-users will notice interruption in their services.
Log data on the old application server will be lost but no customer data will be affected.
Database Server
The live Remote Sampler ecosystem has two Amazon SQL Server RDS database servers operating in different AWS regions (eu-west-1 Ireland and eu-west-2 London). Each customer Remote Sampler system has a primary database on one of the two RDS servers.
Each database server is backed up in its entirety daily at 1:50 am GMT. The daily backups are held in duplicate with one copy in either region.
In the event of a disaster scenario that amounts to the loss of a database server in one of the regions, the user database will be restored from the latest database snapshot to the operational database running in the alternate region. In this case, a system outage of 1-2 hours may occur depending on the size of the database.
During this process, the system will be unavailable for data transfer activities but users with work already on their mobile devices are able to continue to complete work as normal and any completed work will be queued for transfer as soon as service is resumed. It will not be possible to assign new work to the field users until the issue is resolved. It will not be possible to send data back to external systems (e.g. LIMS) until the issue is resolved.
Data loss will be limited to one operational day of sample data in the worst case scenario. In the majority of cases, data loss will be much more limited.
Local Components
Adaptor Service
The adaptor service is a bi-directional communications service that is installed on customer infrastructure and it allows data to flow from a cloud Remote Sampler to a LIMS (or other data system). The adaptor service does not persist any customer data and only saves log files to the local machine. Loss of the server running the adaptor service will cause an interruption to the communications between Remote Sampler and the external data system and the loss of the local log files. No customer data will be affected.
Recovery of the adaptor service is constrained by the processes in place within the customer organisation. If the server is backed up and can be restored in its entirety, the adaptor service will resume when the server is brought back online. It will not be possible to assign new work to the field users until the issue is resolved. It will not be possible to send data back to external systems (e.g. LIMS) until the issue is resolved.
Recovery Time Objective and Recovery Point Objective
To summarise the above, the table below outlines the RTO and RPO for each of the components of a live Remote Sampler system.
Component | RTO | RPO |
---|---|---|
Application Server | 20 mins | 0 hrs |
Database | 2 hrs | Min 5 minutes to Max 24 hrs |
Adaptor Service | Dependent on Customer | 0 hrs |
Please note that in a recovery situation, these activities will be performed in parallel, independently of one another.