python SMTP发送中文名的附件乱码

Posted by [Kohn] on Friday, January 7, 2022
Last Modified on Monday, May 8, 2023
本文阅读量

最近需要开发一个功能支持通过SMTP发送附件到某个邮箱. 网上抄了一份代码:

atta_obj = MIMEApplication(content)
atta_obj.add_header('Content-Disposition', 'attachment', filename=atta["Name"])

对于英文文件名的附件能正常发送, 但是对于中文文件名, 收到的附件文件名自动变成了AT00001.bin

排查smtp报文内容, 发现对于英文附件名, 对应的报文内容是:

Content-Disposition: attachment; filename="hehe.txt"

而对于中文附件名, 报文内容变成了:

Content-Disposition: =?utf-8?b?YXR0YWNobWVudDsgZmlsZW5hbWU9IuWWneWWnS50eHQi?=

显然, 其中关键的"attachment; filename=“这个字段没了, 变成了一堆奇怪的字符.

查看add_header的源码:

def add_header(self, _name, _value, **_params):
        """Extended header setting.

        name is the header field to add.  keyword arguments can be used to set
        additional parameters for the header field, with underscores converted
        to dashes.  Normally the parameter will be added as key="value" unless
        value is None, in which case only the key will be added.  If a
        parameter value contains non-ASCII characters it must be specified as a
        three-tuple of (charset, language, value), in which case it will be
        encoded according to RFC2231 rules.

        Example:

        msg.add_header('content-disposition', 'attachment', filename='bud.gif')
        """
        parts = []
        for k, v in _params.items():
            if v is None:
                parts.append(k.replace('_', '-'))
            else:
                parts.append(_formatparam(k.replace('_', '-'), v))
        if _value is not None:
            parts.insert(0, _value)
        self._headers.append((_name, SEMISPACE.join(parts)))

可以看到, add_header不是简单的将文本放到对应的header map里, 还做了两件事, 一个是将’_‘替换成’-’, 这个与我们遇到的问题无关, 另外还调用_formatparam做了一次编码, 我们遇到的这个问题文件名丢失的问题看上去就是这里编码出了问题.

def _formatparam(param, value=None, quote=True):
    """Convenience function to format and return a key=value pair.

    This will quote the value if needed or if quote is true.  If value is a
    three tuple (charset, language, value), it will be encoded according
    to RFC2231 rules.
    """
    if value is not None and len(value) > 0:
        # A tuple is used for RFC 2231 encoded parameter values where items
        # are (charset, language, value).  charset is a string, not a Charset
        # instance.
        if isinstance(value, tuple):
            # Encode as per RFC 2231
            param += '*'
            value = utils.encode_rfc2231(value[2], value[0], value[1])
        # BAW: Please check this.  I think that if quote is set it should
        # force quoting even if not necessary.
        if quote or tspecials.search(value):
            return '%s="%s"' % (param, utils.quote(value))
        else:
            return '%s=%s' % (param, value)
    else:
        return param

可以看到, add_header要求包含非ascii码的时候, 必须要用3元组来传递, 然后_formatparam会将三元组按照rfc2231的规则编码. 而我们之前的代码里没有这么传.

最终将代码改成:

atta_obj.add_header("Content-Disposition","attachment", filename=("utf-8", "", basename(atta["Name"]).encode("utf-8")))

之后, 问题解决

Stack Overflow上相应的问题: https://stackoverflow.com/questions/34668240/email-an-attachment-with-non-ascii-filename-with-python-email