Transwarp TDH8.0 must read 2: 10 data models are fully supported, and the future belongs to a multi-model big data platform

Transwarp TDH8.0 must read 2: 10 data models are fully supported, and the future belongs to a multi-model big data platform


Transwarp released version 8.0 of Transwarp's fast big data platform TDH in March 2021. I believe many users are very interested in this product. This series of articles will introduce you to the new features and technological innovations of TDH8.0 one by one. Help enterprise-level data platform users have a more comprehensive and in-depth understanding of cutting-edge big data technology, and better technology selection. You can also see our videos on Transwarp's official video account, Transwarp community service account, bilibili, Tencent Video and other sites. Copy code

Wonderful review of past issues

TDH8.0 must read: Why do you need a multi-model data management platform that is decoupled from storage and computing

Are you still using a single model database in 2021

Nowadays, more and more companies are talking about digital transformation. In the early stage, companies will choose some key scenarios for data collection, storage, analysis, decision-making, and application attempts. A single, relatively fixed, mature scenario can usually be supported by purchasing suitable big data or database products on the market.

With the deepening of digital transformation and the rapid development of enterprises, the expansion of business departments, unpredictable changes in demand, the advent of business innovation opportunities, and the improvement of corporate management standards, when various situations arise, independent big data and database products are like Data islands have become barriers to data intercommunication between different scenarios, projects, businesses, and departments.

In the process of data fusion innovation, enterprises may need to use richer data such as relational storage, text storage, graph storage, object storage, search engine, geospatial storage, key-value storage, wide table storage, time series data storage, event storage, etc. Store the model. The use of multiple single-model databases will lead to a series of problems such as data redundancy, data consistency management, data cross-database analysis, and resource allocation. At the same time, the language and interface of multiple products are not unified, the learning cost is high, the operation and maintenance cost is high, and the total cost of ownership of the system will continue to increase.

Why do companies need a multi-model big data platform

In recent years, more and more companies have gradually realized that the future big data platform will not only need to configure different data models for different project scenarios to ensure its high performance, but also make data operation and operation and maintenance more convenient and more unified. Therefore, the concurrent use of multiple data models in a unified platform has become more and more popular.

The early implementation paths of several multi-model data platforms simply combined multiple single-model databases into one software system. Users can use relational databases to persist structured table data; use document storage to store unstructured object data; use key/value storage to store hash tables; use graph databases to store highly linked reference data. Combining multiple single-model databases in the same project simply unifies the interface and cannot fundamentally solve the problem.

In contrast, the native multi-model big data platform has natural advantages in the following aspects:

**1. Stronger data consistency. **When the business requires different data models, the multi-model big data platform naturally supports one piece of logical data, multiple data modeling, applied to multiple different scenarios. It avoids the problems of data consistency, data import and export delay, and data redundancy when using multiple single data model products.

**2. More flexible resource elasticity. **Multi-model big data platform, pools storage and computing resources of different models, can increase or decrease the types of data models at any time according to business needs, flexibly deploy and recycle computing and storage resources, truly allocate on demand, and recycle when used up , Make better use of storage and computing resources more flexibly and fully.

**3. More concise operation and operation and maintenance. **Multiple single-model database products often have different interfaces and different syntaxs. Developers have high learning costs and high barriers to professional skills. Using a unified multi-model big data platform, developers only need to learn a unified language and a unified interface to operate multiple data models, which significantly reduces the difficulty.

Transwarp's multi-model big data platform implementation path

The current common multi-model database architecture is as follows. The traditional architecture mainly adopts three implementation modes:

The first: develop an independent and complete storage and calculation strategy for each new data model. The disadvantage is the coupling of storage and computing. The more models supported, the higher the development volume and complexity of the system and the consumption of storage and computing resources.

The second: use a single storage engine to support multiple storage models. The disadvantage is that because different computing data models have different storage requirements, a single storage engine cannot match a suitable storage strategy, which limits the performance of a multi-model database.

The third type: provide a unified user interface on top of multiple independent databases, and forward multiple underlying databases. The disadvantage is that due to the inconsistency of the underlying multiple database development languages, the actual development is difficult, and the cost of troubleshooting is also high.

These three implementations have different degrees of problems. In order to solve these problems, we need a unified architecture to support multiple models, high availability and high performance at the same time. Transwarp's high-speed big data platform product TDH (Transwarp Data Hub) version 8.0 adopts an original layered architecture design: it provides a unified SQL compiler layer, a unified distributed computing engine layer, a unified distributed data management system layer and The unified resource scheduling layer realizes support for 10 data model models based on the storage-computing solution coupling.

SQL layer: unified SQL compiler

Quark is a distributed SQL compiler independently developed by Transwarp. It is compatible with SQL compilers in a variety of dialects, including dialects such as HiveQL, Oracle, DB2, and Teradata, as well as operators and type systems. Each database product in TDH complies with consistent SQL specifications. Users do not need to worry about the interface and development language switching caused by scene switching and database switching. The unified SQL query makes the learning cost of developers extremely low, the developed code is more portable, and the technology docking is easier.

Computing layer: Transwarp Nucleon, a unified distributed computing engine

Nucleon is a distributed computing engine independently developed by Transwarp. The calculation engine can automatically match high-performance algorithms according to different storage engines, without manual intervention by the user, so as to conveniently realize cross-database association and avoid data import and export.

Data management layer: a unified data storage system provides common storage management services for different storage engines

TDDMS is a distributed data management system independently developed by Transwarp, which manages the strong consistency between multiple copies of data; manages the reasonable shard distribution of data on the storage medium, and automatically manages the data redistribution when the storage capacity is expanded, and makes full use of storage resources ; Ensure high data availability, and keep data storage services uninterrupted when storage hardware fails. TDFS (Transwarp Distributed File System) is a distributed file system independently developed by Transwarp. It provides file directory structure and related services; it is mainly used for data exchange in the form of files when data is imported and exported in batches.

Resource management layer: unified resource scheduling system TCOS

TCOS is a cloud native operating system independently developed by Transwarp, which fits the server hardware and operating system; provides a unified resource scheduling framework, through containerized orchestration, unified scheduling of various basic resources such as computing, storage, and networking. It supports one-click deployment of TDH, online expansion and shrinkage, and also supports priority-based preemptive resource scheduling and fine-grained resource allocation. TCOS is built based on advanced cloud native technology, adapted to a variety of mainstream CPU architectures and multiple operating systems, and supports mixed deployment of servers with different hardware and different operating systems. When the cluster is expanded, customers do not need to worry about the compatibility of new and old equipment, and resource utilization is higher.

Heterogeneous storage engine layer: Support 10 storage models with 8 heterogeneous storage engines

Using Transwarp's multi-model data management platform, different sources of data are still stored using different storage engines to ensure high performance. Different databases are structured in a unified multi-model data platform, and cross-database correlation analysis does not require additional data export and import processes, which avoids data redundancy and is very convenient to use. TDH8.0 provides 8 independent storage engines to ensure the high performance of different storage models. Users can add or remove different storage engines at any time according to business needs, so that resources can be allocated on demand.

1. Relational analysis engine Inceptor-relational data storage

Transwarp Inceptor is a relational analysis engine independently developed by Transwarp Technology, which provides high-performance analysis services of PB-level massive data. Inceptor is the world's first product to pass the TPC-DS international benchmark test of analysis and decision-making systems; it also supports complete SQL standard syntax, is compatible with Oracle, IBM DB2, Teradata dialects, is compatible with Oracle and DB2 stored procedures, and can smoothly migrate applications; supports distribution Type transaction processing to ensure strong data consistency. Inceptor helps users quickly develop applications such as data lakes and data warehouses .

2. Wide table database Hyperbase-wide table storage, object storage, text storage

Transwarp Hyperbase is a NoSQL wide table database independently developed by Transwarp Technology, which supports millions of high concurrency and millisecond low latency business requirements. Hyperbase supports the storage of structured data and unstructured data such as text, images, videos, and objects; supports indexing technologies such as full-text indexing and secondary indexing; provides multi-tenant management; supports SQL standard syntax and is compatible with open source HBase. Hyperbase helps users quickly develop applications such as historical data query and business online retrieval .

3. Distributed graph database StellarDB-graph storage

Transwarp StellarDB is an enterprise-level distributed graph database independently developed by Transwarp Technology, which provides high-performance graph storage, calculation, analysis, query and display services. StellarDB supports native graph storage, tens of billions of points, trillions of edges, and PB-level large-scale graph data storage; it has 10+ layers of deep link analysis capabilities, provides rich graph analysis algorithms and depth graph algorithms; supports standard graph query language and Compatible with OpenCypher, and has the ability to display massive data in 3D graphics. StellarDB helps users quickly develop applications such as fraud detection, recommendation engines, social network analysis, and knowledge graphs .

4. Search engine Transwarp Scope-full text search

Transwarp Scope is a distributed search engine independently developed by Transwarp Technology. It provides interactive multi-dimensional retrieval and analysis services for PB-level massive data, which can realize highly reliable, highly scalable full-text search and flexible query. Respond quickly to users' retrieval needs in milliseconds; quickly recover from single points of failure in minutes. Transwarp Scope supports structured, semi-structured, and unstructured data storage such as pictures, audio and video, and Internet data, and guarantees strong data consistency. Transwarp Scope helps users quickly develop applications such as text information analysis and retrieval, and enterprise-level search engines .

5. Spatio-temporal database Spacture-geospatial storage

Transwarp Spacture is a self-developed distributed spatio-temporal database that provides storage, query, analysis and mining services for massive data such as spatial geography, spatiotemporal trajectories, and remote sensing images. Spacture has high-performance data reading and writing and analysis capabilities. Supports OGC standard graphics types and spatial relationships, compatible with common open source and commercial GIS software; built-in efficient algorithms such as spatio-temporal indexing, spatial topological geometry, and remote sensing image processing. Spacture helps users quickly develop applications such as spatiotemporal query analysis, spatiotemporal pattern mining, spatiotemporal trajectory clustering, etc., which are widely used in scenarios such as location services, city management, transportation and logistics, and epidemic prevention and control .

6, key-value database Transwarp KeyByte-key-value storage

Transwarp KeyByte is a high-performance key-value database that provides real-time data insertion and high-concurrency retrieval services. KeyByte adopts a master-slave high-availability architecture, supports disaster tolerance, automatic switchover between master and backup, and fault migration; compatible with Redis core data structure and API; supports data persistence; supports elastic expansion. KeyByte helps users quickly develop hotspot data caching, high-concurrency data storage, real-time or time-limited business support and other applications.

7. Time series database Transwarp TimeLyre-time series data storage

Transwarp TimeLyre is a time series database that provides efficient compressed storage and high-performance analysis services for massive time series data. TimeLyre supports high-speed data reading and writing, processing hundreds of thousands of records and hundreds of queries per second. TimeLyre helps users quickly develop applications such as real-time monitoring, real-time early warning, and real-time fault diagnosis of various businesses and equipment .

8. Transwarp Event Store-Event Store

Transwarp Event Store is a high-throughput distributed NoSQL database that provides storage and processing services for messages and events. Event Store supports data persistence; supports data replay from a specified point in time to ensure data sequence; it has elastic expansion and fault tolerance. Event Store helps users quickly develop applications such as log collection, application monitoring, streaming data processing, and online analysis .

In addition to the above 8 storage engines, TDH8.0 still provides our classic products: real-time stream computing engine Slipstream and data science platform Sophon Discover, which can meet the diverse usage scenarios of users.

Real-time stream computing engine Slipstream-real-time monitoring, real-time ETL

Transwarp Slipstream is an enterprise-class, high-performance real-time stream computing engine independently developed by Transwarp Technology, which supports millions of high-throughput, millisecond-level low-latency business requirements. Slipstream supports two modes: event-driven and micro-batch processing, supports exactly-once semantics, complex event processing (CEP), rule engine and other functions, and supports SQL programming and development. Slipstream helps users quickly develop real-time data warehouse, real-time report analysis, real-time intelligent recommendation, real-time fraud detection and risk control .



Data science platform Sophon Discover-data mining, machine learning

Transwarp Sophon Discover is a data mining analysis and exploration toolkit independently developed by Transwarp Technology . It contains a rich distributed algorithm library and built-in multiple industry application modules such as financial anti-fraud and public opinion text mining. Sophon Discover can realize data analysis and processing in multiple programming languages such as R, Python, Spark, and supports the unified operation and management of Tensorflflow, Torch and other deep learning algorithm frameworks and heterogeneous hardware resources.

TDH8.0 practice plan

In TDH 8.0, Slipstream is used for real-time stream processing; Inceptor is used for batch processing of structured data, data lakes, and data warehouses; Hyperbase is used for unstructured data such as wide table storage, text storage, and object storage; thus real-time stream processing and batch Integrated solution for processing, data lake, and data warehouse.

The platform also provides other services, such as the search engine Scope for full-text search; the graph database StellarDB for multi-level link relationship analysis between entities; the spatio-temporal database Spacture for spatio-temporal geographic analysis and so on.

Multi-model big data platform, compared with traditional open source solutions, has low architecture complexity, low development cost, low operation and maintenance cost, and high data processing efficiency.


Transwarp multi-model big data management platform TDH 8.0 adopts an innovative architecture of four layers of unified interface, computing, management, and scheduling, and ten heterogeneous storage models to ensure high performance, high reliability, and high availability of different data models. , To achieve the goal of more flexible resource configuration, more concise and easy-to-use operation and maintenance.

In the future, we believe that from large enterprises and institutions, to small and micro enterprises, to individual development enthusiasts, they can easily build, develop, operate and maintain their own data platforms through convenient access and a friendly development environment. And application. The idea of bringing big data from everyone and serving everyone has changed from science fiction to technological reality.