如何在有限的内存限制下实现数十亿级手机号码去重

美国国美萝卜 2022-01-12

展开全文

版权申明

原创文章：本博所有原创文章，欢迎转载，转载请注明出处，并联系本人取得授权。
版权邮箱地址：banquan@mrdwy.com

问题难点

文本和数据的去重是经常要用到的重要操作，普通数量的文本处理并不存在技术上的难点，可以直接在内存中高效处理，但是如果涉及到的文本量达到了数十亿级别，直接在内存中处理文本去重工作几乎变成不可实现，例如假设有个文本中包含有20亿手机号码，每个手机号码共计11位数字，int最大值只能保存2^31-1=2147483647，不够表示手机号，因此只能用String或者long型来表示手机号。

Java基本数据类型

byte         1字节
short        2字节
int            4字节
long         8字节
char         2字节（C语言中是1字节）可以存储一个汉字
float         4字节
double     8字节
boolean   false/true(理论上占用1bit,1/8字节，实际处理按1byte处理)
那么用long类型一个手机号占用8字节，用String类型，则需要char[11]的字符数组来表示，占用2 * 11 = 22字节
20亿则分别需要
long        16,000,000,000字节 = 14.9G
String      44,000,000,000字节 = 40.98G
如果去重操作直接将手机号加载到内存中处理显然是十分浪费系统内存的方式，况且需要如此大内存的机器专门运行去重操作，也非常不划算，就算有这样大内存的机器供你使用，直接从几十亿文本中判断字符串有没有重复的计算量也是非常耗时的一个操作，重复进行几十亿次，显然这个时间量是无法接受的。

解决思路

数据结构

大文本去重最优的数据结构应该是使用Bitmap，那么具体要如何实现呢？
简单来说就是直接使用二进制来表示手机号是否存在，一个byte类型占用1字节，用byte类型最多可以存储8bit数据，也就是8个二进制位，分别能够表示8个数据是否存在，用0表示不存在1表示存在，那么理论上，我们使用1个字符的内存容量就可以表示8个手机号是否存在了。
中国的手机号全部是1开头，目前启用了18、17、15、14、13几个号码段，甚至可以细分到3位，但为了简便，以及考虑到未来号码扩展，我们简略的直接只忽略1的开头，用Bitmap来表示全部号码是否存在只需要占用9,999,999,999个bit位也就是1,249,999,999.875 约等于 1,250,000,000 个字节 = 1.164g的空间！
比起上一节的空间占用量是不是一个非常惊人的节约呢？

如何计算

我们可以直接申请一个1.165G存储空间当作手机的字典表，且所有位全部用0表示目前没有任何号码在我们的字典中。
字典的第0位表示手机号码100 0000 0000 最后一位表示199 9999 9999，如果文本中存在号码100 0000 0000则将第一位改成1即可表示100 0000 0000号码存在，依次类推。

运算

因为Java中最小的数据类型为byte，占用1字节，因此用二进制表示为0000 0000，每一位分别有数据的表示我们可以定义一个数组如下：
0000 0001
0000 0010
0000 0100
0000 1000
0001 0000
0010 0000
0100 0000
1000 0000
为了方便我们将其转化成10进制则分别为
1, 2，4，8，16，32，64，-128（字节第一位为符号位）
那么每个字符，我们都可以表示为8个上述的二进制数

新增手机号

我们现在需要新增一个133 3333 3333 的手机号如何运算呢？
首先去掉最头1，则剩下33 3333 3333 将其除以 8 = 416,666,666 （取整）也就是该手机号在整个字典表的第416,666,666个字符，然后将33 3333 3333 除以8 取模 = 5 表示该手机号在 416,666,666 的第5位，那么用二进制表示这个字符表示应该是0010 0000，运算用或运算 0000 0000 | 0010 0000 = 0010 0000
复杂情况如果该字符还有其他号码存在例如原始字符为 0100 0000 | 0010 0000 = 0110 0000
证明用或运算我们也可以完美的完成新增手机号的操作。

删除手机号

我们再次用133 3333 3333 号码举例，刚才我们已经知道了该手机号在字典表的第416,666,666个字符的第5位，如何删除呢：
1、求补 0010 0000 ^ 1111 1111 = 1101 1111
2、与操作 0010 0000 & 1101 1111 = 0000 0000
如果是其他号码的结果 0110 0011 & 1101 1111 = 0100 0011
同样可以完美的删除该字符的第5位

判断手机号是否存在

还是用刚才的例子手机号133 3333 3333 在字典表的第416,666,666个字符的第5位，我们要判断这个位置是否为1
这里有两种算法都可以实现：
1、判断 0010 0000 == 原始数据 & 0010 0000 如果等式成立，则表示该位置有号码，否则没有
演示
0000 0000 & 0010 0000 = 0000 0000 != 0010 0000 则没有该号码
0010 0000 & 0010 0000 = 0010 0000 == 0010 0000 则有该号码
该字符有其他号码的情况
1000 0000 & 0010 0000 = 0000 0000 != 0010 0000 该位置没有该号码
1010 0000 & 0010 0000 = 0010 0000 == 0010 0000 该位置有该号码

2、判断 1111 1111 == 原始数据 | (0010 0000 ^ 1111 1111)
0000 0000 | 1101 1111 = 1101 1111 != 1111 1111 没有号码
0010 0000 | 1101 1111 = 1111 1111 == 1111 1111 有号码
有其他号码
1000 0000 | 1101 1111 = 1101 1111 != 1111 1111 没有号码
1010 0000 | 1101 1111 = 1111 1111 == 1111 1111 有号码
以上我们完美的证明了两种算法都可以表示是否存在某个手机号

算法实现代码

package org.tcrow.datastructure;

import com.google.common.io.FileWriteMode;
import com.google.common.io.Files;

import javax.annotation.concurrent.ThreadSafe;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;
import java.util.regex.Pattern;

/**
 * @author tcrow.luo
 * @date 2018/8/27
 * @description BitMap处理手机号去重，支持海量手机号数据去重，处理时间毫秒级，理论上经过改造可以支持更大的整数去重运算，但是初始化需要占用更多的存储空间
 */
@ThreadSafe
public class Mobile {

    private final static int INIT_BUFFER_SIZE = 1024 * 1024;
    /**
     * 正则表达式：验证手机号
     */
    private final static String REGEX_MOBILE = '^((10[0-9])|(11[0-9])|(12[0-9])|(13[0-9])|(14[0-9])|(15[0-9])|(16[0-9])|(17[0-9])|(18[0-9])|(19[0-9]))\\d{8}$';

    /**
     * 二进制1~8位分别为1的值，与原值进行或操作即可完成在号码库的新增操作
     */
    private final static byte[] ARRAY_BYTE = {0b00000001, 0b00000010, 0b00000100, 0b00001000, 0b00010000, 0b00100000, 0b01000000, -0b10000000};


    /**
     * 二进制掩码，-1 用二进制表示 为 11111111
     * 因此任何字节异或掩码后可以获得取反值，例如 00000001 ^ 11111111 = 11111110
     */
    private final static byte MASK_BYTE = -1;

    /**
     * 用于存储手机号码是否存在
     * 因为中国手机号码都是1开头，所以第一位省略
     * 我们需要表示最大9999999999个号码是否存在
     * 1字节 = 8 bit 最多可以表示8个号码
     * 因此需要空间 9999999999 / 8 = 1249999999.875 约等于 125 * 10 ^ 7 字节 约为 1.165 G 空间
     * 直接加载到内存中比较浪费，因此可以创建一个二进制文件直接表示，然后通过RandomAccessFile类读文件相应的位
     */
    private File dictFile;

    private RandomAccessFile file;

    /**
     * 读写全局锁，保证 读读共享, 读写互斥, 写写互斥
     */
    private final static ReadWriteLock LOCK = new ReentrantReadWriteLock();

    public Mobile(final String filePath) throws FileNotFoundException {
        dictFile = new File(filePath);
        if (!dictFile.exists()) {
            try {
                init();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        file = new RandomAccessFile(dictFile, 'rw');
    }

    private void init() throws IOException {
        LOCK.writeLock().lock();
        try {
            Files.createParentDirs(dictFile);
            int loop = 1250000000 / INIT_BUFFER_SIZE   1;
            byte[] buffer = new byte[INIT_BUFFER_SIZE];
            for (int i = 0; i < loop; i  ) {
                Files.asByteSink(dictFile, FileWriteMode.APPEND).write(buffer);
            }
        } finally {
            LOCK.writeLock().unlock();
        }
    }

    /**
     * 新增电话号码到字典中
     *
     * @param mobile
     */
    public void insert(String mobile) throws IOException {
        if (!isMobile(mobile)) {
            throwException(mobile);
        }
        if (hasMobile(mobile)) {
            return;
        }
        long no = Long.parseLong(mobile) - 10000000000L;
        int byteNum = (int) (no / 8);
        int bit = (int) (no % 8);
        byte b = read(byteNum);
        //对应位或操作 e.g. 11100000 | 00000001 == 11100001 则成功插入了标志位
        b = (byte) (b | ARRAY_BYTE[bit]);
        write(byteNum, b);
    }

    /**
     * 从字典中删除电话号码
     * @param mobile
     * @throws IOException
     */
    public void delete(String mobile) throws IOException {
        if (!isMobile(mobile)) {
            throwException(mobile);
        }
        if (!hasMobile(mobile)) {
            return;
        }
        long no = Long.parseLong(mobile) - 10000000000L;
        int byteNum = (int) (no / 8);
        int bit = (int) (no % 8);
        byte b = read(byteNum);
        //字节与操作 e.g. 00010001 & 11111110 == 00010000 则成功删除了标志位，对应标志位异或掩码得对应的补码 e.g. 00000001 ^ 11111111 == 11111110
        b = (byte) (b & (ARRAY_BYTE[bit] ^ MASK_BYTE));
        write(byteNum, b);
    }

    private void throwException(String mobile) {
        throw new RuntimeException('The string \''   mobile   '\' is not the mobile number.');
    }

    private void throwUnknownException() {
        throw new RuntimeException('read data unknown exception');
    }

    public boolean hasMobile(String mobile) throws IOException {
        if (!isMobile(mobile)) {
            throwException(mobile);
        }
        long no = Long.parseLong(mobile) - 10000000000L;
        int byteNum = (int) (no / 8);
        int bit = (int) (no % 8);
        byte b = read(byteNum);
        // 01000111 & 00000010 == 00000010   01000101 & 00000010 == 0
        // 10111111 | 11111101 == 11111111 == -1  10111101 | 11111101 == 11111101 != -1
        // 两种判断方式都可以实现，取简单的方式直接按位与操作后等于判断位则表示标志位有值
        if (ARRAY_BYTE[bit] == (byte) (b & ARRAY_BYTE[bit])) {
            return true;
        } else {
            return false;
        }
    }

    private boolean isMobile(String mobile) {
        return Pattern.matches(REGEX_MOBILE, mobile);
    }

    private byte read(int byteNum) throws IOException {
        LOCK.readLock().lock();
        byte[] buffer = new byte[1];
        try {
            file.seek(byteNum);
            int read = file.read(buffer);
            if (read <= 0) {
                throwUnknownException();
            }
        } finally {
            LOCK.readLock().unlock();
        }
        return buffer[0];
    }

    private void write(int byteNum, byte b) throws IOException {
        LOCK.writeLock().lock();
        try {
            file.seek(byteNum);
            byte[] buffer = new byte[1];
            buffer[0] = b;
            file.write(buffer);
        } finally {
            LOCK.writeLock().unlock();
        }
    }

}

我这边为了节约内存，字典是直接在外部存储申请的空间，操作上会比内存中耗费稍微多一点时间，因为要读取外部存储的数据，但相应的可以对计算机内存的需求度降低到最少，有了这个算法，我们要实现海量手机号的去重就变得轻而易举了。

延伸阅读

bit-map主要用于对存储空间大量压缩的情况，例如其他字符文本的去重也是可以使用的，例如要进行邮箱地址去重，可以将邮箱地址进行hash运算得到等长的字符串，然后转换成ascii码，再存储到bitmap中，就可以进行邮箱地址的去重了。

再进一步我们实际的场景中可能不止要对文本进行去重，还得进行热点扫描，例如百度的搜索热点，这个可以使用到Trie树的数据结构，可以最快速的查找并获得字典出现频率，我这里贴上一个简单的Trie树的实现代码。

package org.tcrow.datastructure;

import javax.annotation.concurrent.NotThreadSafe;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;

/**
 * Trie树
 *
 * @author tcrow.luo
 */
@NotThreadSafe
public class Trie {

    /**
     * 树深度
     */
    private int depth = 0;

    /**
     * 节点总数
     */
    private int nodeNum = 0;

    /**
     * 子节点数，这里设置为支持全部ASCII码178个字符，如果确保输入字符串只有字母可改为26个字符
     */
    private final int SIZE = 178;

    private TrieNode root;

    /**
     * 构造Trie树初始化数据
     */
    public Trie() {
        this.nodeNum = 0;
        this.depth = 0;
        this.root = new TrieNode();
    }


    /**
     * 子类节点
     */
    private class TrieNode {
        private int passed;
        private int ended;
        private boolean isEnd;
        private char value;
        private TrieNode parent;
        private TrieNode[] children;

        TrieNode() {
            this.passed = 0;
            this.ended = 0;
            this.children = new TrieNode[SIZE];
        }

        TrieNode(TrieNode parent) {
            this.passed = 0;
            this.ended = 0;
            this.parent = parent;
            this.children = new TrieNode[SIZE];
        }

    }


    /**
     * 插入字符串
     *
     * @param str
     * @return
     */
    public boolean insertStr(String str) {
        char[] strArr = str.toCharArray();
        TrieNode current = this.root;
        for (char c : strArr) {
            if (null == current.children[c]) {
                current.children[c] = new TrieNode(current);
                current.children[c].value = c;
                current.children[c].passed = 1;
                this.nodeNum  ;
            } else {
                current.children[c].passed  ;
            }
            current = current.children[c];
        }
        current.isEnd = true;
        current.ended  ;
        if (strArr.length > this.depth) {
            this.depth = strArr.length;
        }

        return true;
    }

    /**
     * 统计字符串出现次数
     *
     * @param str
     * @return
     */
    public int countPrefix(String str) {
        TrieNode current = this.root;
        char[] strArr = str.toCharArray();
        for (char c : strArr) {
            if (null == current.children[c]) {
                return 0;
            } else {
                current = current.children[c];
            }

        }
        return current.ended;
    }

    /**
     * 统计出现次数最多的字符串
     *
     * @param rank 名次数
     * @return
     */
    public String[] tops(int rank) {
        TrieNode current = this.root;
        LinkedList<Integer> topTimes = new LinkedList<>();
        LinkedList<String> result = new LinkedList<>();
        ergodic(topTimes, result, current, rank);
        List<String> retList = new ArrayList<>();
        int length = rank > result.size() ? result.size() : rank;
        for (int i = 0; i < length; i  ) {
            retList.add('单词：'   result.get(i)   '，词频：'   topTimes.get(i));
        }
        return retList.toArray(new String[rank]);
    }


    /**
     * TOP统计递归
     *
     * @param topTimes 词频次数链表
     * @param result   字符串链表
     * @param current  当前节点
     * @param rank     统计项目数
     */
    private void ergodic(LinkedList<Integer> topTimes, LinkedList<String> result, TrieNode current, int rank) {
        TrieNode[] children = current.children;
        for (TrieNode child : children) {
            if (null != child) {
                if (child.ended > 0) {
                    if (topTimes.size() == 0) {
                        topTimes.add(child.ended);
                        result.add(getStr(child));
                    } else {
                        for (int i = 0; i < topTimes.size(); i  ) {
                            if (topTimes.get(i).intValue() > child.ended) {
                                continue;
                            }
                            topTimes.add(i, child.ended);
                            result.add(i, getStr(child));
                            if (topTimes.size() > rank) {
                                topTimes.removeLast();
                                result.removeLast();
                            }
                            break;
                        }
                    }
                }
                if (child.passed > 0) {
                    ergodic(topTimes, result, child, rank);
                }
            }
        }

    }

    /**
     * 获取节点字符串
     *
     * @param current
     * @return
     */
    private String getStr(TrieNode current) {
        String str = new String(new char[]{current.value});
        if (this.root.equals(current.parent)) {
            return str;
        } else {
            return getStr(current.parent)   str;
        }
    }

    /**
     * 遍历节点
     *
     * @param current
     */
    private void ergodic(TrieNode current) {
        TrieNode[] children = current.children;
        String word;
        for (TrieNode child : children) {
            if (null != child) {
                if (child.ended > 0) {
                    word = getStr(child);
                    System.out.println('单词:'   word   ',词频:'   countPrefix(word));
                }
                if (child.passed > 0) {
                    ergodic(child);
                }
            }
        }
    }

    /**
     * 遍历整个树，打印出现的字符串
     */
    public void printAllStr() {
        TrieNode current = this.root;
        ergodic(current);
    }

}