Business scene:
Rapid optimization of work order tables with tens of millions of data
Database partitioning technology:
Database partitioning does not generate a new table, but evenly distributes the table data to different disks, systems or different server storage media. In fact, it is still a table. The business code does not need to be changed.
Advantages of database partitioning: Partitions can store more data than a single file system or hard disk. When cleaning data, you can directly delete the partition where the discarded data is located. It can greatly improve query efficiency, similar to sub-library and table sub-division. For example, if there are 20 million data, divided into 10 partitions, each partition will only have 2 million data, so when I query, I will only query these 2 million data. Disadvantages of database partitioning: Database partitioning has a limitation. The partition field must be a unique index, which means that all my query conditions must be accompanied by this unique index to tell the database which partition to query. Moreover, this solution relies more on the DBA team and puts pressure on the database, so this solution is basically not used. But there is also an advantage that there is no need to introduce other frameworks, and it can be achieved solely by the database.
Introduction to hot and cold separation:
Hot data refers to data that needs to be distributed to users immediately. That is, after being captured from the data source and processed, it needs to be immediately stored in a storage medium that can be quickly distributed for use by APIs or systems directly facing users. Hot data needs to focus on ensuring service quality and stability. In order to ensure the timeliness of data, it is also a high priority data in data processing. Cold data refers to data that does not need to be sent to users immediately.
These numbers will not be distributed to users as is; they need to be accumulated over a long period of time so that we can obtain higher-level analysis based on them.
Hot and cold separation
Cold-hot separation refers to dividing the database into cold storage and hot storage when processing data. The cold storage refers to the database used to store data that has reached a final state (cold data), and the hot storage is used to store data that still needs to be modified (hot data).
Cold and hot separation means that when processing data, the data is divided into cold storage and hot storage. The cold storage stores the data that has reached the final state and is not frequently used, while the hot storage stores the data that needs to be modified and is frequently used.
When can we use hot and cold separation:
1. After the data reaches the final state, there is only a demand for reading, but no demand for writing. For example, when an order reaches the final state, there will be no more state changes.
2. Users can accept the separate query of new and old data. For example, some e-commerce companies only allow querying orders within 3 months by default. If you want to query orders beyond 3 months, you have to visit other pages. In other words, there will be no need to query cold and hot data at the same time.
Ideas for implementing hot and cold separation:
1. Issues to consider when using MySQL for both cold and hot data: 1. How to determine whether a data is cold data or hot data? Generally, we can determine it based on the fields of the main table, such as creation time and status. In our use process, how to distinguish between cold data and hot data depends on specific business needs. 2. How to trigger the separation of cold and hot data?
There are three ways to trigger hot and cold separation
1. Implemented by business code. When the data in the table changes, a check is performed every time the data changes to see if it can be put into the cold storage. If it can, it is put into the cold storage and the hot storage data is deleted (the most real-time and the most changes)
2. By monitoring MySQL’s binlog log (low latency)
3. Through scheduled tasks and scheduled scanning, archive the data that can be placed in the cold storage (high latency, cannot achieve real-time performance)
How to achieve hot and cold separation
To achieve hot and cold separation is to determine whether the data can enter the cold storage. If it can, insert the cold data and delete the hot data. However, we need to pay attention to the following points:
1. Transaction consistency problem: We need to develop a backup plan. The most common solution is to add a field to mark whether the data can be entered into the cold data, and then schedule a task to scan the data every day to see if it is consistent.
2. For problems with large amounts of data, batch processing can be used, that is, multiple data can be processed at one time
3. Concurrent processing When the data is very large, batch processing cannot meet the requirements, and multi-threading must be used. Here, locks must be done well, which can be achieved using database locks.
4. How to use hot and cold separation to distinguish according to the front-end page
5. How to migrate historical data? You can write a Java program to migrate it. If you use a scheduled task solution, you don’t need to pay attention to it.
2. Cold data is stored in hbase
Hbase: HBase is a distributed, scalable NoSQL database that supports massive data storage. It is a sub-project of Apache’s Hadoop project and is a database suitable for unstructured data storage. In the first issue above, both hot and cold data use mysql, which has two problems: Users will still have scenarios where they query cold data. And it is still very slow when users query cold data. Then you need a database that can store hundreds of millions of data, support simple combined keyword queries, and the stored data will not change. So we use hbase to implement it. Hbase is a non-relational database that naturally supports distributed. mysql, pgsql, oracle: relational databases, fields in different tables are related
Advantages: can implement complex business logic Disadvantages: performance is worse than non-relational databases Non-relational databases: redis, hbase, es; query efficiency is higher
Disadvantages: cannot implement complex business logic
What is a non-relational database and what is a relational database:
Relational database: A database based on the relational model, which uses mathematical concepts and methods such as set algebra to process data in the database.
Simply put, relational databases use mathematical methods such as selection, projection, connection, union, intersection, difference, division, addition, deletion, query and modification to store and query data. SQL statements can be used to conveniently perform very complex data queries in one table and between multiple tables. It is highly secure.
Non-relational database: referred to as NOSQL, is based on the correspondence between key-value pairs and does not require SQL layer parsing, so the performance is very high. However, it is not suitable for multi-table joint queries and some more complex queries. NoSQL is used for the storage of ultra-large-scale data.
Comparison between NoSQL and relational databases?
advantage:
1) Cost: NoSQL databases are simple and easy to deploy, and are basically open source software. They do not require a lot of cost to purchase and use like Oracle, and are cheaper than relational databases.
2) Query speed: NoSQL databases store data in cache, while relational databases store data on hard disks, so naturally the query speed is far slower than that of NoSQL databases.
3) Data storage format: Nosql’s storage format is key, value, document, image, etc., so it can store basic types as well as objects or collections, while the database only supports basic types (varchar, etc.).
4) Scalability: Relational databases have limitations such as multi-table query mechanisms such as joins, which makes expansion difficult.
shortcoming:
1) The maintenance tools and materials are limited, because NoSQL is a new technology and cannot be compared with the relational database technology that has been around for more than ten years.
2) No support for SQL. If industrial standards such as SQL are not supported, it will incur certain learning and usage costs for users.
3) Does not provide relational database transaction processing.
Disadvantages of hot and cold separation:
The business requirements are very high, requiring that hot and cold data cannot be queried at the same time, cold data cannot be modified, and cold data cannot be queried in complex ways. However, it is very cost-effective and can quickly achieve project optimization.