分享

文件编码 Byte order mark

 戴维图书馆 2018-08-08

https://en./wiki/Byte_order_mark

Byte order mark

From Wikipedia, the free encyclopedia
Jump to navigationJump to search

The byte order mark (BOM) is a Unicode character, U+FEFF byte order mark (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text:[1]

  • The byte order, or endianness, of the text stream;
  • The fact that the text stream's encoding is Unicode, to a high level of confidence;
  • Which Unicode encoding the text stream is encoded as.

BOM use is optional. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.

Unicode can be encoded in units of 8-bit, 16-bit, or 32-bit integers. For the 16- and 32-bit representations, a computer receiving text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM is encoded in the same scheme as the rest of the document and becomes an invalid Unicode code point if its bytes are swapped. Hence, the consumer of the text can examine these first few bytes to determine the endianess, without requiring some contract or metadata outside of the text stream itself. Generally the receiving computer will swap the bytes to its own endianess, if necessary, and would no longer need the BOM for processing.

The byte sequence of the BOM differs per Unicode encoding (including ones outside the Unicode standard such as UTF-7, see table below), and none of the sequences is likely to appear at the start of text streams stored in other encodings. Therefore, placing an encoded BOM at the start of a text stream can indicate that the text is Unicode and identify the encoding scheme used. This use of the BOM character is called a "Unicode signature".[2]

Usage[edit]

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favor of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be only used as a BOM.

UTF-8[edit]

The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF.

The Unicode Standard permits the BOM in UTF-8,[3] but does not require or recommend its use.[4] Byte order has no meaning in UTF-8,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[6][7] The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[8]

Not using a BOM allows text to be backwards-compatible with some software that is not Unicode-aware. Examples include programming languages that permit non-ASCII bytes in string literalsbut not at the start of the file.

Heuristic analysis can ascertain with high confidence whether UTF-8 is in use without the BOM due to the large number of byte sequences that are invalid in UTF-8.

Microsoft compilers[9] and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.

UTF-16[edit]

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.

  • If the 16-bit units are represented in big-endian byte order, the BOM will appear in the sequence of bytes as 0xFE 0xFF
  • If the 16-bit units use little-endian order, the BOM will appear in the sequence of bytes as 0xFF 0xFE

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a "zero width no-break space".

If there is no BOM, it is possible to guess whether the text is UTF-16 and its byte order by searching for ASCII characters (i.e. a 0 byte adjacent to a byte in the 0x20-0x7E range, also 0x0A and 0x0D for CR and LF). A large number (i.e. far higher than random chance) in the same order is a very good indication of UTF-16 and whether the 0 is in the even or odd bytes indicates the byte order. However, this can result in both false positives and false negatives.

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored. The W3C/WHATWG encoding standard used in HTML5 specifies that content labelled either "utf-16" or "utf-16le" are to be interpreted as little-endian "to deal with deployed content".[10]However, if a byte-order mark is present, then that BOM is to be treated as "more authoritative than anything else".[11]

Programs that interpret UTF-16 as a byte-based encoding may display a garbled mess of characters, but ASCII characters would be recognizable because the low byte of the UTF-16 representation is the same as the ASCII code and therefore would be displayed the same. The upper byte of 0 may be displayed as nothing, white space, a period, or some other unvarying glyph.

UTF-32[edit]

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.

The BOM for little-endian UTF-32 is the same pattern as a little-endian UTF-16 BOM followed by a NUL character, an unusual example of the BOM being the same pattern in two different encodings. Programmers using the BOM to identify the encoding will have to decide whether UTF-32 or a NUL first character is more likely.

Byte order marks by encoding[edit]

This table illustrates how the BOM character is represented as a byte sequence in various encodings and how those sequences might appear in a text editor that is interpreting each byte as a legacy encoding (CP1252 and caret notation for the C0 controls):

EncodingRepresentation (hexadecimal)Representation (decimal)Bytes as CP1252 characters
UTF-8[a]EF BB BF239 187 191
UTF-16 (BE)FE FF254 255þÿ
UTF-16 (LE)FF FE255 254ÿþ
UTF-32 (BE)00 00 FE FF0 0 254 255^@^@þÿ (^@ is the null character)
UTF-32 (LE)FF FE 00 00255 254 0 0ÿþ^@^@ (^@ is the null character)
UTF-7[a]2B 2F 76 38
2B 2F 76 39
2B 2F 76 2B
2B 2F 76 2F[b]
2B 2F 76 38 2D[c]
43 47 118 56
43 47 118 57
43 47 118 43
43 47 118 47
43 47 118 56 45
+/v8
+/v9
+/v+
+/v/
+/v8-
UTF-1[a]F7 64 4C247 100 76÷dL
UTF-EBCDIC[a]DD 73 66 73221 115 102 115Ýsfs
SCSU[a]0E FE FF[d]14 254 255^Nþÿ (^N is the "shift out" character)
BOCU-1[a]FB EE 28251 238 40ûî(
GB-18030[a]84 31 95 33132 49 149 51„1·3
  1. Jump up to:a b c d e f g This is not literally a "byte order" mark, since the byte is also the code unit in these encodings and there is no byte order to resolve. The sequence can be used to indicate the encoding of the text which it is preceding, however.[5][12]
  2. Jump up^ In UTF-7, the fourth byte of the BOM, before encoding as base64, is 001111xx in binary. The final two bits, xx, are not specifically part of the BOM, but contain the first two bits of the first encoded character following the BOM. All four possible byte combinations are shown in the table, as well as a fifth which is used for an empty string.
  3. Jump up^ If no following character is encoded, 38 is used for the fourth byte and the following byte is 2D.
  4. Jump up^ SCSU allows other encodings of U+FEFF, the shown form is the signature recommended in UTR #6.[13]

See also[edit]

References[edit]

  1. Jump up to:a b "FAQ - UTF-8, UTF-16, UTF-32 & BOM"Unicode.org. Retrieved 2017-01-28.
  2. Jump up^ "The Unicode® Standard Version 9.0" (PDF)The Unicode Consortium.
  3. Jump up^ "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). p. 36. Retrieved 2009-03-29Table 2-4. The Seven Unicode Encoding Schemes
  4. Jump up^ "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). p. 36. Retrieved 2008-11-30Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature
  5. Jump up to:a b "FAQ - UTF-8, UTF-16, UTF-32 & BOM: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?"Unicode.org. Retrieved 2009-01-04.
  6. Jump up^ "Re: pre-HTML5 and the BOM from Asmus Freytag on 2012-07-13 (Unicode Mail List Archive)"Unicode.org. Retrieved 2012-07-14.
  7. Jump up^ "Bug ID: JDK-6378911 UTF-8 decoder handling of byte-order mark has changed"Bugs.. Retrieved 2017-01-28.
  8. Jump up^ Yergeau, Francois (November 2003). UTF-8, a transformation format of ISO 10646IETFdoi:10.17487/RFC3629. RFC 3629. Retrieved May 15, 2014.
  9. Jump up^ Alf P. Steinbach (2011). "Unicode part 1: Windows console i/o approaches". Retrieved 24 March 2012However, since the C++ source code was encoded as UTF-8 without BOM (as is usual in Linux), the Visual C++ compiler erroneously assumed that the source code was encoded as Windows ANSI.
  10. Jump up^ "UTF-16LE"Encoding Standard. WHATWG.
  11. Jump up^ "Decode"Encoding Standard. WHATWG.
  12. Jump up^ "RFC 3629 - UTF-8, a transformation format of ISO 10646"Tools.. 2003-11-08. Retrieved 2017-01-28.
  13. Jump up^ Markus Scherer. "UTS #6: Compression Scheme for Unicode"Unicode.org. Retrieved 2017-01-28.

External links[edit]

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多