In the cloud-native era, ideas and best practices for constructing multi-active disaster recovery systems for enterprises

In the cloud-native era, ideas and best practices for constructing multi-active disaster recovery systems for enterprises

Description: For the interpretation of the concept of cloud native, we often hear micro-services, these containers, so in the end what kind of relationship these technologies now disaster recovery? In fact, the needs of disaster recovery exist in all walks of life. For example, the financial industry also has a strong demand for disaster recovery. But how to build disaster tolerance and survivability is actually something every company needs to think about. This sharing hopes to provide you with some relevant ideas.

For the interpretation of the concept of cloud native, you often hear about microservices and containers. So what is the relationship between these technologies and enterprise disaster recovery? In fact, the needs of disaster recovery exist in all walks of life. For example, the financial industry also has a strong demand for disaster recovery. But how to build disaster tolerance and survivability is actually something every company needs to think about. This sharing hopes to provide you with some relevant ideas.

Evolution of disaster recovery system functions

Today, speaking live, in fact, a disaster recovery system inside, we can look at the evolution of the entire disaster recovery architecture:

Disaster Recovery 1.0 : During the construction of the original application system, the business system is deployed in the computer room based on the traditional architecture. What about the relevant emergency measures or troubleshooting methods? During this period, only data backup was considered , mainly in the form of cold backup. In addition to the computer room that provides services, an additional computer room may be considered for disaster scenarios. From the perspective of system construction, you may choose to use a separate computer room to synchronize data to another computer room for cold backup, and switch when a problem occurs. However, in actual situations, it is generally not the choice to switch computer rooms, even in the financial industry that does routine exercises for disaster recovery systems every year, they are afraid to switch when there is a problem with the system during the production process.

Disaster Tolerance 2.0 : More consideration is given to applications . For example, cloud native, or higher-level applications in the traditional IOE system, switching is not just simply cutting over and loading the original cold standby data, but hoping to quickly apply the application to another when cutting over. The engine room pulled up. In order to achieve replication on the data layer without too much delay, we usually have a requirement for active-active. However, there are generally some requirements for dual-active, such as within a certain range of distance to be able to do dual-active in the same city. Hyperactive is more likely to be applied to the AQ model, which means doing full business in the production side and doing other business in another computer room.

Disaster Tolerance 3.0 : I hope to live more in different places. What is more? This means that it is no longer limited to two computer rooms, but hopes to have three or more computer rooms. For example, Ali's business is distributed in multiple computer rooms. How to provide external business support at the same time requires corresponding technical support. And living more in different places means not limited to distance, such as 200 kilometers or the same city, because today's computer rooms are deployed all over the country.

Overview of business continuity and disaster recovery

For business continuity, there is actually a systematic approach, which refers to the specifications and guidance accumulated over the years in the construction of disaster recovery systems. There are several dimensions:

1. The multi-active business is not the same as the original disaster recovery, which directly pulls the same business peering in another computer room, but chooses valuable business. Because in the construction of a disaster tolerance system, it is very difficult to achieve more activity in all businesses in terms of cost and technology.

2. To guarantee real-time operation, it is necessary to ensure that core business will not stop service due to various reasons such as power outages in the computer room.

3. M stands for guarantee system. Nowadays, all walks of life may have their own different methods and management methods, and what Ali provides is to transform this part of things into technologies, tools and products, so that everyone can quickly build their abilities in the future. Based on this set of methods and products to build more business activities.

The BCM system and IT disaster recovery and recovery capabilities are a practical guiding framework. In terms of completeness, business continuity at the top is the goal, and the following are various ways to achieve it. At the bottom, you can see, for example, the IT plan, the plan for handling failures when there are special problems in business continuity, etc. These things were taken into account when doing disaster recovery, but we took these things into consideration in the product system from the perspective of how to work. inside.
The several disaster recovery methods mentioned here are actually relatively common: from cold standby to dual-active in the same city, dual-active in the same city and cold standby in different places (two places and three centers), these are relatively standardized in the industry. the way. And living in different places is like providing the ability to live in three computer rooms in two places and three centers at the same time. On the basis of the previous, there are some differences from the original traditional disaster tolerance. Multi-living is also different from the traditional in terms of construction costs. For example, the ability to build multi-living in different places will require more investment than traditional (such as dual-active in the same city and three centers in two places) in terms of construction costs.

When constructing multi-activity capabilities, the actual situation of the business is also taken into consideration. For example, in different industries, or for example, in terms of multiple activities, only two sides are required to read. Then, under different circumstances, the construction cost and the time to switch services are different. The ability to live in different places can be switched in minutes from the horizontal time axis, but if it is based on cold standby, it may need to be switched in days.

Why does Ali do more work

Under Ali's business model, the reasons for doing more work are similar to those mentioned earlier. As mentioned earlier, if you do not use multiple activities, you will need to build another computer room. The cost is very high, because that computer room is only used for usual data synchronization and is not in operation. During this period, it needs to be uninterrupted. Locally update the version corresponding to the production system and the version of the disaster recovery system. But in reality, when the original cold standby or the three centers in the two places fail to switch, it is very likely that they cannot be switched back after the switch.

And do live there are three main demands:

1. Resources. With the rapid development of today's business, the single-site resource capacity is limited. We know that cloud native and cloud computing provide high availability and disaster tolerance capabilities, but cloud computing is deployed in different computer rooms, and the ability to live more across regions requires the support of underlying infrastructure. We hope to expand our business to unlimited to limit the physical room, but also to achieve a plurality of rooms simultaneously connected service;

2, there is a wide range of business needs require local or remote deployment requirements;

3. Aiming at disaster recovery events. For example, the optical cable is cut or the power supply and heat dissipation problem of the computer room due to weather, which will cause the failure of the single computer room. Today's demand is not limited to a certain computer room, but multiple computer rooms are deployed in different forms across the country, which can be flexibly adjusted according to the business model.

Because these demands are more urgent for the ability of Duohuo, Ali has made Duohuo solutions and products based on its own business needs and technical capabilities.

Dismantling of multi-active architecture

Dismantling of multi-active architecture

1. Mutual backup in different places : Today everyone talks about how good cloud native is, how good cloud computing is, and there is not much survivability. These technologies are actually idle. It doesn't work in the cold standby state, and the decision to cut to cold standby in which state mostly depends on people's decision-making. Since layer-by-layer reporting has a relatively large impact on the business, more mature customers will have some plans, such as what kind of impacts and failures need to be switched, but in fact, they generally dare not do the switching based on the cold standby mode.

2. Dual-active in the same city : There is a certain distance limit. The common active-active mode can be distributed in the upper application layer, such as the cloud-native PaaS layer. Both computer rooms can be distributed. At the data level, because the same city can be used for storage, the main computer room has problems with the database and cut to the standby computer room, but the advantage is that the machines and resources in the two computer rooms are in an active state. In addition, when the computer room is active, there is no need to worry about the difference between the production version and the version of the standby computer room, and you will not be afraid to cut.

3. Two locations and three centers : In addition to considering the problem of providing in the same city, the ability to deal with failures will be stronger. Build a cold standby computer room in a different place. This is similar to the first solution for cold standby. The cold standby computer room is usually not used. , May do some other synchronization, only switch when a failure occurs.

4. Multiple activities in different places : There are multiple data centers capable of providing services to the outside world at the same time. Due to the limitation of distance, replication at the data level may be limited to the network, and the problem of delay will definitely exist. There are many technical problems to be solved, such as how to switch from the Beijing computer room to Shanghai quickly, and how to cut the underlying data without complete synchronization due to physical constraints. Our operating mode is not switched like the original disaster recovery method, but a lot of preparation work and follow-up data compensation process. We integrate this set of things into the product system, and if there is no way to break through the physical limits, we use the architectural model to optimize.

Progressive multi-active disaster recovery architecture

For the key core business, in fact, when doing multiple systems or projects, some sorting of the business will be done. Today, I am talking about unitized sorting.
Progressive live disaster recovery architecture

Double reading, two places and three centers, under normal circumstances, at most half and half of the two computer rooms are divided, which is the simplest. According to this model, the rules of business segmentation can be found. For example, it can be divided into half and half of the business according to the user number. In Multi-Activity, we hope to be able to configure it flexibly, such as the processing capacity of the computer room, what the fault is like, and the traffic can be adjusted to 50%, 60%, or other proportions. The same is true in multiple computer rooms, and the traffic access conditions can be distributed uniformly.

In terms of technology, for example, remote backup is one-way data replication, and remote active-active is two-way. Two-way means that there may be problems in either of the two computer rooms and can be switched to each other. One of the most important of these is the technical realization. At the digital level, we must find a way to avoid the problem of circular replication. After the data is synchronized, another computer room believes that the new data is copied back. In the case of multiple computer rooms, the traditional way is to use the serial number in the database. In multi-live, the serial number needs to be generated by rules to be globally unique, and is not based on a single computer room but on the entire cluster. We need to consider more The serial number generated in a computer room cannot be repeated, which requires the product to have some rules to solve this problem.

Multi-live disaster recovery solution

Architectural diagram of multi-life product solution

1. Access layer : The first thing to be solved in Multi-Live is the very important traffic access layer. The access layer can finely control the access rules. According to the business fragmentation rules, it must be accurate to map to each computer room in the lower layer. After the traffic comes in, it is necessary to determine which computer room the traffic user should provide services in. How is this achieved in practice?

The traditional way is domain name switching. For example, the front-end domain name has two computer rooms, and the domain name address is switched when switching, then the entire business was originally connected to computer room A, and it can be switched to another computer room B through the domain name. The problem with this method is that it affects the business being done. For example, after a problem occurs in a certain computer room, the business needs to be quickly switched to another computer room. If the domain name is switched, the ongoing business at the bottom layer will be affected. In addition, this kind of low-level switching cannot be linked to the entire cloud-native PaaS layer. The upper layer is cut and the lower layer cannot perceive it. It is not known that the previous traffic has been switched to another computer room, including the middle call may still be in the original In the computer room unit, this is actually a relatively large impact on business continuity. In extreme cases, this mode can solve some problems. For example, if a computer room cannot do any business and there is a spare computer room, then cutting the domain name is also a way.

Another way is to use cloud-native microservices, which can mark the traffic in the microservices. After the marking is completed, the mark is passed down in the cloud-native microservice technology system, and the request is considered to be in a certain unit as much as possible. Or do it in a computer room, and you cannot jump to another computer room.

2. Application layer : The specification of access routing in the middle layer includes service routing components, which can be provided separately in our product system. For example, some customers say that they do not want to use a full set of solutions, because they may have all the open source components used in the middle layer of the solution, but they want to achieve the ability to live more. Then the upper layer can use our entire multi-active management and control flow, accurately define how many logical units there are, and provide APIs for intermediate calls. The globally unique sequence number, routing rules, and fragmentation rules are all provided to him by the previous layer. Among them, marking and traffic identification seem to be relatively simple. In fact, for example, in a multi-active scenario, some distributed messages that will be used when decoupling and decoupling, as well as messages used in the architecture, If you switch in a certain computer room without finishing consumption, then what method needs to be used to synchronize to another computer room? This kind of problem needs to be solved with the help of cloud native.

3. Data layer : involves the logic of copying and writing. The write prohibition control in our solution will have a logic on the database, that is, once the front-end switch occurs, the code will be automatically generated. For example, when the data of the switched target computer room is restored, the code with time will be automatically generated, and the writing action will be released again only when the data is restored. We will protect the database and judge the delay of the database by prohibiting writing. If the underlying data synchronization capabilities are not strong enough, switching and most of the services can be done, but many write-in services may not be able to be done, because the database is restricted by the write prohibition rule. In addition, the rules for data synchronization and the requirements for replication under multiple computer room logic are more controlled in terms of overall rules.
Based on the full package system, we propose a concept (shown above): MSHA four-letter acronym stands for is to provide cloud-native ability to live it more than one product today, we hope to play a small role in this four numbers above : Prevention of 0, 1, 5, and 10 minutes.

The first is 0-minute prevention . As mentioned above, the cut flow can be deployed in two computer rooms in a blue-green publishing environment. This is one method. Even in the same computer room, two units can be defined under the logic of the control console, and the blue and green releases can be quickly performed in the same computer room. The blue-green release of a computer room is limited by the support of technical products. Through this component, it can be clearly delineated which resources belong to one unit and which resources belong to another unit. At the same time, the blue-green release of this unit can be quickly realized.

2. the 5-minute positioning . In the original city, such as cold standby disaster recovery technology, it is often very difficult to make decisions, or whoever makes the switch has to bear the consequences. We hope that based on this platform, we can intuitively see the impact of today's failures, and the corresponding corresponding emerge What kind of actions or operations need to be taken by the stakeholders to restore the application; when a failure occurs, the system can quickly find the problem of the failure, for example, after 5 minutes of locating the problem, it will initiate it Decide whether to make a cut flow;

3. the recovery in 10 minutes . Finally, we hope that through this model, the entire process of re-operation of the entire business can be restored within 10 minutes .

Best Practices for Multi-Live Disaster Recovery

Here are a few examples of Alibaba s applications to external enterprises. This multi-active disaster recovery capability is not only available on public clouds, because cloud does not mean that when applications are deployed on the cloud, all high availability is naturally Provided by the cloud, when using resources, you will find that the cloud actually has different regions, and the same region contains different availability zones. When using on the public cloud, it needs to be combined with the actual situation. For example, most customers may be in the south, then a node may be opened in the south computer room. Then when there is a problem in the Ali computer room, the customer's business will be corresponding. Affected, although customers deploy the corresponding business on the cloud, the products on the cloud also provide high availability, but once the failure scenario involves the computer room, the corresponding business will still be affected. Therefore, the solution provided is that the multi-active capability can be deployed in the computer room like commercial software in addition to being deployed on the cloud.

Case I: double living city

A certain logistics customer actually uses Multi-Activity within the same city. Although the traditional technology is not a big problem, the benefits of using Multi-Activity are reflected in that, for example, there is a corresponding SDK, which can be automatically identified, and there is no need to do too much business. With multiple modifications, the marking request can be passed on automatically. After the disaster tolerance is completed, the RTO is much faster than before.

Case II: remote dual reading

The difficulty in this case of double reading in different places is that the distance exceeds thousands of kilometers. Under this distance limitation, both reading and writing are actually difficult. Data replication itself has delays. The logic of using this set of products also hopes to unify the control and traffic levels to clearly know which is the reading business. Which services are imported into the computer room of the reading, and what is the status of the replication. The minute-level RTO has been greatly improved compared to the original, and it can dynamically switch online and flexibly.

Case 3: remote dual live

This enterprise customer who uses HyperMetro in different places currently has two computer rooms to write, and it may expand in the future. When this plan was implemented, a lot of product-adaptive development was done, because if you want to realize the reading, there is a lot of work in the middle layer for the basic capabilities of the original product, and the whole process is from the development of multi-life products and then forward. Adapt to the application scenario, and then complete the transformation with the business. The core point is business continuity, so it does not mean that all businesses will use multiple activities in the computer room in the future, but only for key businesses. For example, for example, every year on Double Eleven, our core business is to ensure that the order cannot be affected. Then, through decoupling or other methods, the priority of logistics will not be as high as the order transaction type in terms of business continuity. . The key point is how to ensure that the services and products involved in the core transaction link will not cause problems when switching in the multi-active dimension.

This multi-live management and control platform recommends that you experience it for yourself. After two or more units are defined in the console, when one of the computer rooms fails, we hope to quickly switch its application to the other computer room through Multi-Activity. The prerequisite for switching is to define the points in the management console. Whether it is a logical point in a single computer room or a point in multiple physical computer rooms, it must be mapped to the multi-active management and control platform. In the control console, we will allocate some rules, such as the access of a single service, in what dimension to split the access traffic, or mark it by ID. It is relatively simple to dynamically display which dimensions of the flow to another computer room when the flow is cut, and it can be quickly allocated when a fault occurs.

Nowadays, we help customers deploy capabilities, and often do some cut-flow and drills through the console in the system to see if the computer room is affected, because the entire system is equipped with other solutions, such as fault drills and cooperation. These failures switch the application to another computer room and so on.


The ability to live and disaster recovery has been practiced in Ali s internal business for many years, and it took a long time to evolve it into a product. The purpose is to hope that today s set of products and solutions can help companies build themselves within 30 days The ability to live more. In particular, there are many product deployments on the public cloud that are already ready-made enterprises, but in fact, it takes less time to build. We hope that this set of products and solutions can help enterprises to quickly realize failover and build multi-activity capabilities in minutes.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.