Redis 源码阅读系列-sds

· 2023-11-16 ·

Redis 源码阅读第二弹：sds

1 Redis 为什么不使用 C 语言原生字符串？

Redis 设计时，考虑到 C 语言字符串的诸多限制，重新设计与实现了字符串。
老生常谈的问题，Redis 为什么要设计专用的字符串，也就是 sds 结构体？
我问了 chatGPT，它给出了以下的回答：

英文：

Redis does not use the C language's built-in strings primarily because Redis is designed to be a high-performance, in-memory data store that offers advanced data manipulation and storage capabilities. The built-in C strings in the C programming language have limitations that make them unsuitable for Redis' requirements. Here are a few reasons why Redis does not use C's built-in strings:

Dynamic resizing: Redis needs to efficiently handle strings of varying lengths, including very large strings. C's built-in strings have a fixed size and cannot dynamically resize themselves, which would limit Redis' ability to handle dynamic data.
Memory management: Redis manages memory dynamically and optimizes memory usage. C's built-in strings do not provide automatic memory management or support for efficient memory allocation strategies, which Redis requires to optimize memory usage and prevent memory leaks.
Performance considerations: Redis is designed to offer high-performance data storage and manipulation. C's built-in strings have limitations in terms of performance optimizations, such as efficient substring extraction, concatenation, and comparison operations, which could impact Redis' overall performance.
Additional data structure requirements: Redis supports various data types, including lists, sets, hashes, and more. C's built-in strings do not provide the necessary abstractions and functionality required to handle these complex data structures efficiently.
Compatibility with other languages: Redis is widely used as a data store in various programming languages. By implementing its own string data structure, Redis ensures consistency and compatibility across different language bindings and client libraries.

By designing its own string data structure, Redis can tailor it specifically to meet the performance, memory management, and functionality requirements of a high-performance data store. This allows Redis to offer efficient storage and manipulation of data, making it popular for caching, messaging, and other data-intensive applications.

中文：

Redis不使用C语言本身的字符串主要是因为Redis被设计为高性能的内存数据存储，提供高级的数据操作和存储功能。C语言的内置字符串具有一些限制，使其不适用于Redis的需求。以下是Redis不使用C语言内置字符串的几个原因：

动态调整大小：Redis需要高效地处理长度可变的字符串，包括非常大的字符串。C语言的内置字符串具有固定的大小，无法动态调整大小，这将限制Redis处理动态数据的能力。
内存管理：Redis动态管理内存并优化内存使用。C语言的内置字符串不提供自动的内存管理或支持高效的内存分配策略，而这是Redis需要的，以优化内存使用并防止内存泄漏。
性能考虑：Redis旨在提供高性能的数据存储和操作。C语言的内置字符串在性能优化方面存在限制，例如高效的子字符串提取、连接和比较操作，这可能会影响Redis的整体性能。
额外的数据结构要求：Redis支持各种数据类型，包括列表、集合、哈希等。C语言的内置字符串不提供处理这些复杂数据结构所需的抽象和功能。
与其他语言的兼容性：Redis被广泛用作各种编程语言中的数据存储。通过实现自己的字符串数据结构，Redis确保不同语言绑定和客户端库之间的一致性和兼容性。

通过设计自己的字符串数据结构，Redis可以根据高性能数据存储的要求进行定制。这使得Redis能够高效地存储和操作数据，因此在缓存、消息传递和其他数据密集型应用中非常受欢迎。

我个人觉得最主要的是以下 3 个方面：

二进制安全（binary safe）：所谓“二进制安全”，简单来说指的是可以存储任意类型的数据。C 语言字符串中，\0 表示字符串结尾，如果字符串内容想保存 \0这类特殊字符，C 语言自带的字符串显然是无法满足要求的。
高性能：比如说字符串长度的读取、字符串追加、比较、扩缩容等操作，比原生的 C 语言字符串更加高效。
避免缓冲区溢出：C 语言中，在字符串拼接时，如果忘记提前申请分配内存空间，会导致缓冲区溢出。在 Redis 中，字符串拼接会自动检查空间是否足够，不够则扩容，避免缓冲区溢出。

2 sds 结构体

/* Note: sdshdr5 is never used, we just access the flags byte directly.
 * However is here to document the layout of type 5 SDS strings. */
struct __attribute__ ((__packed__)) sdshdr5 {
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};

在 Redis 中，设计了 5 个 sds 结构体，用于存储不同长度的字符串。设计 5 个不同的结构体，主要是出于节约存储的目的，比如说用 sdshdr8 就能存下的字符串，如果用 sdshdr64 存储，那就是 8 倍的存储空间消耗，对于一家稍微上体量的互联网公司，可能是过百亿级别的数据量，单就字符串存储上就要多耗费 70GB 的内存。由此可见，Redis 在数据结构上是做了诸多设计优化的。

字段注释：

struct __attribute__ ((__packed__)) sdshdr8 {
  	// buf 数组已使用字节数
    uint8_t len;   
  	// buf 数组总共分配的字节数
    uint8_t alloc;  
  	// 低三位用于表示字符串类型
    unsigned char flags;
  	// 用于存储字符串的真实数据
    char buf[];          
};

在结构体上