Overview
LMDB(Lightning Memory-Mapped Database) is a tiny database with some great capabilities:
LMDB基本架构lmdb的基本架构如下:
lmdb的基本做法是使用mmap文件映射,不管这个文件存储实在内存上还是在持久存储上。lmdb的所有读取操作都是通过mmap将要访问的文件只读的映射到虚拟内存中,直接访问相应的地址.因为使用了read-only的mmap,同样避免了程序错误将存储结构写坏的风险。并且IO的调度由操作系统的页调度机制完成。而写操作,则是通过write系统调用进行的,这主要是为了利用操作系统的文件系统一致性,避免在被访问的地址上进行同步。 lmdb把整个虚拟存储组织成B+Tree存储,索引和值读存储在B+Tree的页面上.对外提供了关于B+Tree的操作方式,利用cursor游标进行。可以进行增删改查。 使用Memory MapMemory Map原理内存映射就是把物理内存映射到进程的地址空间之内,这些应用程序就可以直接使用输入输出的地址空间.由此可以看出,使用内存映射文件处理存储于磁盘上的文件时,将不需要由应用程序对文件执行I/O操作,这意味着在对文件进行处理时将不必再为文件申请并分配缓存,所有的文件缓存操作均由系统直接管理,由于取消了将文件数据加载到内存、数据从内存到文件的回写以及释放内存块等步骤,使得内存映射文件在处理大数据量的文件时能起到相当重要的作用。 Linux下mmap的实现过程与普通文件io操作mmap映射原理与过程:
一般文件io操作方式:
通过内存映射的方法访问硬盘上的文件,效率要比read和write系统调用高, read()是系统调用,其中进行了数据拷贝,它首先将文件内容从硬盘拷贝到内核空间的一个缓冲区,然后再将这些数据拷贝到用户空间,在这个过程中,实际上完成了 两次数据拷贝 ;而mmap()也是系统调用,如前所述,mmap()中没有进行数据拷贝,真正的数据拷贝是在缺页中断处理时进行的,由于mmap()将文件直接映射到用户空间,所以中断处理函数根据这个映射关系,直接将文件从硬盘拷贝到用户空间,只进行了 一次数据拷贝 。因此,内存映射的效率要比 read/write效率高。 lmdb使用mmap过程lmdb创建完env对象,打开时,会做data file和lock file的mmap映射:
其他时刻都直接使用内存指针,通过系统级别的缺页异常获取对应的数据。页面内数据的获取和使用 页面头部大小及内容是固定的,具体的含义代表根据flags决定,在头部之后紧接的是node,真正的key-value值对所在位置的索引,因此访问这些node时通过指针计算即可得到对应的位置。 lmdb 之后是如何将页面给映射进进程地址空间呢.lmdb通过 在lmdb中B+Tree的是基于append-only B+Tree改造的。对于数据增加、修改、删除导致页面增加时,pageno也增加,当旧页面(数据旧版本)被重用时,pageno 保持不变,因此pageno保持了在数据文件中的顺序性,从而在获取页面时,只需要进行简单计算即可以。同时在创建env对象时,数据库已经被整个映射进整个进程空间,因此系统在映射时,会给数据库文件保留全部地址空间,从而在根据上述算法获取真实数据库,系统触发缺页错误,进而从数据文件中获取整个页面内容。此为最简单有效方式,否则不将全部数据映射进地址空间,对于未映射部分还需要在访问页面时判断是否已经被映射,未被映射时进行映射。 在需要时在通过文件方式写入。lmdb保证任意时刻只有一个写操作在进行,从而避免了并发时数据被破坏。 B-tree/B+tree/B*treeB-treeB-tree又叫平衡多路查找树。一棵m阶的B-tree (m叉树)的特性如下:
B+treeB+ tree:是应文件系统所需而产生的一种B-tree的变形树。一棵m阶的B+-tree和m阶的B-tree的差异在于:
B*treeB* tree是B+ tree的变体,在B+-tree的非根和非叶子结点再增加指向兄弟的指针;B*-tree定义了非叶子结点关键字个数至少为(2/3)*M,即块的最低使用率为 2/3(代替B+树的1/2)。如下图所示
Morelmdb中的使用lmdb代码主要分为page管理和cursor操作两块实现b-tree结构.
COW and MVCCCOW(Copy-on-write)写入时复制(Copy-on-write,COW)是一种计算机程序设计领域的优化策略。其核心思想是,如果有多个调用者(callers)同时要求相同资源(如内存或磁盘上的数据存储),他们会共同获取相同的指针指向相同的资源,直到某个调用者试图修改资源的内容时,系统才会真正复制一份专用副本(private copy)给该调用者,而其他调用者所见到的最初的资源仍然保持不变。这过程对其他的调用者都是透明的(transparently)。此作法主要的优点是如果调用者没有修改该资源,就不会有副本(private copy)被创建,因此多个调用者只是读取操作时可以共享同一份资源。4 MVCC(Multiversion concurrency control)Multiversion concurrency control (MCC or MVCC), is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memory. If someone is reading from a database at the same time as someone else is writing to it, it is possible that the reader will see a half-written or inconsistent piece of data. There are several ways of solving this problem, known as concurrency control methods. The simplest way is to make all readers wait until the writer is done, which is known as a lock. This can be very slow, so MVCC takes a different approach: each user connected to the database sees a snapshot of the database at a particular instant in time. Any changes made by a writer will not be seen by other users of the database until the changes have been completed (or, in database terms: until the transaction has been committed.) When an MVCC database needs to update an item of data, it will not overwrite the old data with new data, but instead mark the old data as obsolete and add the newer version elsewhere. Thus there are multiple versions stored, but only one is the latest. This allows readers to access the data that was there when they began reading, even if it was modified or deleted part way through by someone else. It also allows the database to avoid the overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old, obsolete data objects. For a document-oriented database it also allows the system to optimize documents by writing entire documents onto contiguous sections of disk—when updated, the entire document can be re-written rather than bits and pieces cut out or maintained in a linked, non-contiguous database structure. MVCC provides point in time consistent views. Read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read, and read these versions of the data. Read and write transactions are thus isolated from each other without any need for locking. Writes create a newer version, while concurrent reads access the older version.5 MVCC/COW在LMDB中的实现LMDB对MVCC加了一个限制,即只允许一个写线程存在,从根源上避免了写写冲突,当然代价就是写入的并发性能下降。因为只有一个写线程,所以不会不需要wal 日志、读写依赖队列、锁队列等一系列控制并发、事务回滚、数据恢复的基础工具。 MVCC的基础就是COW,对于不同的用户来说,若其在整个操作过程中不进行任何的数据改变,其就使用同一份数据即可,若需要进行改变,比如增加、删除、修改等,就需要在私有数据版本上进行,修改完成提交之后才给其他事务可见。 LMDB中,数据操作的基本单元是页,因此COW也是以页为单位,对应函数是 实际上通过以上两个函数实现了MVCC的核心,对于读写的控制,通过 MVCC的一个副作用就是对于存在大量写的应用,其数据版本很多,因此旧数据会占用大量空间,LMDB中通过freedb解决,即将不再使用的旧的数据页面空间插入到一棵B+Tree当中,这样旧空间在所有事务不再访问之后就可以被LMDB使用,从而避免了需要定期执行清理操作。当然其副作用是数据只能保持最新不能恢复到任意时刻, 事务控制事务的基本特征事务是恢复和并发控制的基本单位。它是一个操作序列,这些操作要么都执行,要么都不执行,它是一个不可分割的工作单位。事务是数据库维护数据一致性的单位,在每个事务结束时,都能保持数据一致性。 事务应该具有4个属性:原子性、一致性、隔离性、持久性。这四个属性通常称为ACID特性。
LMDB 中的实现基本思路
Example#include <stdio.h>#include 'lmdb.h'int main(int argc,char * argv[]){int rc;MDB_env *env;MDB_dbi dbi;MDB_val key, data;MDB_txn *txn;MDB_cursor *cursor;char sval[32];/* Note: Most error checking omitted for simplicity */rc = mdb_env_create(&env);rc = mdb_env_open(env, './testdb', 0, 0664);rc = mdb_txn_begin(env, NULL, 0, &txn);rc = mdb_dbi_open(txn, NULL, 0, &dbi);key.mv_size = sizeof(int);key.mv_data = sval;data.mv_size = sizeof(sval);data.mv_data = sval;sprintf(sval, '%03x %d foo bar', 32, 3141592);rc = mdb_put(txn, dbi, &key, &data, 0);rc = mdb_txn_commit(txn);if (rc) {fprintf(stderr, 'mdb_txn_commit: (%d) %s\n', rc, mdb_strerror(rc));goto leave;}rc = mdb_txn_begin(env, NULL, MDB_RDONLY, &txn);rc = mdb_cursor_open(txn, dbi, &cursor);while ((rc = mdb_cursor_get(cursor, &key, &data, MDB_NEXT)) == 0) {printf('key: %p %.*s, data: %p %.*s\n',key.mv_data, (int) key.mv_size, (char *) key.mv_data,data.mv_data, (int) data.mv_size, (char *) data.mv_data);}mdb_cursor_close(cursor);mdb_txn_abort(txn);leave:mdb_dbi_close(env, dbi);mdb_env_close(env);return 0;} Run the example:
LMDB 核心数据结构
struct MDB_env {HANDLE me_fd; /**< The main data file */HANDLE me_lfd; /**< The lock file */HANDLE me_mfd; /**< just for writing the meta pages */uint32_t me_flags; /**< @ref mdb_env */unsigned int me_psize; /**< DB page size, inited from me_os_psize */unsigned int me_os_psize; /**< OS page size, from #GET_PAGESIZE */unsigned int me_maxreaders; /**< size of the reader table *//** Max #MDB_txninfo.%mti_numreaders of interest to #mdb_env_close() */volatile int me_close_readers;MDB_dbi me_numdbs; /**< number of DBs opened */MDB_dbi me_maxdbs; /**< size of the DB table */MDB_PID_T me_pid; /**< process ID of this env */char *me_path; /**< path to the DB files */char *me_map; /**< the memory map of the data file */MDB_txninfo *me_txns; /**< the memory map of the lock file or NULL */MDB_meta *me_metas[NUM_METAS]; /**< pointers to the two meta pages */void *me_pbuf; /**< scratch area for DUPSORT put() */MDB_txn *me_txn; /**< current write transaction */MDB_txn *me_txn0; /**< prealloc'd write transaction */mdb_size_t me_mapsize; /**< size of the data memory map */off_t me_size; /**< current file size */pgno_t me_maxpg; /**< me_mapsize / me_psize */MDB_dbx *me_dbxs; /**< array of static DB info */uint16_t *me_dbflags; /**< array of flags from MDB_db.md_flags */unsigned int *me_dbiseqs; /**< array of dbi sequence numbers */pthread_key_t me_txkey; /**< thread-key for readers */txnid_t me_pgoldest; /**< ID of oldest reader last time we looked */MDB_pgstate me_pgstate; /**< state of old pages from freeDB */# define me_pglast me_pgstate.mf_pglast# define me_pghead me_pgstate.mf_pgheadMDB_page *me_dpages; /**< list of malloc'd blocks for re-use *//** IDL of pages that became unused in a write txn */MDB_IDL me_free_pgs;/** ID2L of pages written during a write txn. Length MDB_IDL_UM_SIZE. */MDB_ID2L me_dirty_list;/** Max number of freelist items that can fit in a single overflow page */int me_maxfree_1pg;/** Max size of a node on a page */unsigned int me_nodemax;#if !(MDB_MAXKEYSIZE)unsigned int me_maxkey; /**< max size of a key */#endifint me_live_reader; /**< have liveness lock in reader table */# define me_rmutex me_txns->mti_rmutex /**< Shared reader lock */# define me_wmutex me_txns->mti_wmutex /**< Shared writer lock */void *me_userctx; /**< User-settable context */MDB_assert_func *me_assert_func; /**< Callback for assertion failures */};
/** Meta page content. * A meta page is the start point for accessing a database snapshot. * Pages 0-1 are meta pages. Transaction N writes meta page #(N % 2). */typedef struct MDB_meta {/** Stamp identifying this as an LMDB file. It must be set * to #MDB_MAGIC. */uint32_t mm_magic;/** Version number of this file. Must be set to #MDB_DATA_VERSION. */uint32_t mm_version;void *mm_address; /**< address for fixed mapping */pgno_t mm_mapsize; /**< size of mmap region */MDB_db mm_dbs[CORE_DBS]; /**< first is free space, 2nd is main db *//** The size of pages used in this DB */#define mm_psize mm_dbs[FREE_DBI].md_pad/** Any persistent environment flags. @ref mdb_env */#define mm_flags mm_dbs[FREE_DBI].md_flagspgno_t mm_last_pg; /**< last used page in file */volatile txnid_t mm_txnid; /**< txnid that committed this page */} MDB_meta;
typedef struct MDB_node {/** lo and hi are used for data size on leaf nodes and for * child pgno on branch nodes. On 64 bit platforms, flags * is also used for pgno. (Branch nodes have no flags). * They are in host byte order in case that lets some * accesses be optimized into a 32-bit word access. */unsigned short mn_lo, mn_hi; /**< part of data size or pgno */unsigned short mn_flags; /**< @ref mdb_node */unsigned short mn_ksize; /**< key size */char mn_data[1]; /**< key and data are appended here */} MDB_node;
struct MDB_txn {MDB_txn *mt_parent; /**< parent of a nested txn *//** Nested txn under this txn, set together with flag #MDB_TXN_HAS_CHILD */MDB_txn *mt_child;pgno_t mt_next_pgno; /**< next unallocated page */txnid_t mt_txnid;MDB_env *mt_env; /**< the DB environment *//** The list of pages that became unused during this transaction. */MDB_IDL mt_free_pgs;/** The list of loose pages that became unused and may be reused * in this transaction, linked through #NEXT_LOOSE_PAGE(page). */MDB_page *mt_loose_pgs;/* #Number of loose pages (#mt_loose_pgs) */int mt_loose_count;/** The sorted list of dirty pages we temporarily wrote to disk * because the dirty list was full. page numbers in here are * shifted left by 1, deleted slots have the LSB set. */MDB_IDL mt_spill_pgs;union {/** For write txns: Modified pages. Sorted when not MDB_WRITEMAP. */MDB_ID2L dirty_list;/** For read txns: This thread/txn's reader table slot, or NULL. */MDB_reader *reader;} mt_u;/** Array of records for each DB known in the environment. */MDB_dbx *mt_dbxs;/** Array of MDB_db records for each known DB */MDB_db *mt_dbs;/** Array of sequence numbers for each DB handle */unsigned int *mt_dbiseqs;/** In write txns, array of cursors for each DB */MDB_cursor **mt_cursors;/** Array of flags for each DB */unsigned char *mt_dbflags;/** Number of DB records in use, or 0 when the txn is finished. * This number only ever increments until the txn finishes; we * don't decrement it when individual DB handles are closed. */MDB_dbi mt_numdbs;unsigned int mt_flags; /**< @ref mdb_txn *//** #dirty_list room: Array size - \#dirty pages visible to this txn. * Includes ancestor txns' dirty pages not hidden by other txns' * dirty/spilled pages. Thus commit(nested txn) has room to merge * dirty_list into mt_parent after freeing hidden mt_parent pages. */unsigned int mt_dirty_room;};
|
|