aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMiss Islington (bot) <31488909+miss-islington@users.noreply.github.com>2022-11-22 10:16:59 -0800
committerGitHub <noreply@github.com>2022-11-22 18:16:59 +0000
commit420b7e8f43fe0527a8e68fca57227e35733d0b2b (patch)
treeef0dc609ef5394c204f7678fbe70bdbd86a5dc1d
parent[3.11] gh-99645: Fix a bug in handling class cleanups in unittest.TestCase (G... (diff)
downloadcpython-420b7e8f43fe0527a8e68fca57227e35733d0b2b.tar.gz
cpython-420b7e8f43fe0527a8e68fca57227e35733d0b2b.tar.bz2
cpython-420b7e8f43fe0527a8e68fca57227e35733d0b2b.zip
[3.11] gh-99146 struct module documentation should have more predictable examples/warnings (GH-99141) (GH-99702)
gh-99146 struct module documentation should have more predictable examples/warnings (GH-99141) * nail down a couple examples to have more predictable output * update a number of things, but this is really just a stash... * added an applications section to describe typical uses for native and machine-independent formats * make sure all format strings use a format prefix character * responding to comments from @gpshead. Not likely finished yet. * This got more involved than I expected... * respond to several PR comments * a lot of wordsmithing * try and be more consistent in use of ``x`` vs ``'x'`` * expand examples a bit * update the "see also" to be more up-to-date * original examples relied on import * so present all examples as if * reformat based on @gpshead comment (missed before) * responding to comments * missed this * one more suggested edit * wordsmithing (cherry picked from commit 22d91c16bb03c3d87f53b5fee10325b876262a78) Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com> Co-authored-by: Skip Montanaro <skip.montanaro@gmail.com>
-rw-r--r--Doc/library/struct.rst282
1 files changed, 206 insertions, 76 deletions
diff --git a/Doc/library/struct.rst b/Doc/library/struct.rst
index 620f50376be..50d70731f77 100644
--- a/Doc/library/struct.rst
+++ b/Doc/library/struct.rst
@@ -12,21 +12,25 @@
--------------
-This module performs conversions between Python values and C structs represented
-as Python :class:`bytes` objects. This can be used in handling binary data
-stored in files or from network connections, among other sources. It uses
-:ref:`struct-format-strings` as compact descriptions of the layout of the C
-structs and the intended conversion to/from Python values.
+This module converts between Python values and C structs represented
+as Python :class:`bytes` objects. Compact :ref:`format strings <struct-format-strings>`
+describe the intended conversions to/from Python values.
+The module's functions and objects can be used for two largely
+distinct applications, data exchange with external sources (files or
+network connections), or data transfer between the Python application
+and the C layer.
.. note::
- By default, the result of packing a given C struct includes pad bytes in
- order to maintain proper alignment for the C types involved; similarly,
- alignment is taken into account when unpacking. This behavior is chosen so
- that the bytes of a packed struct correspond exactly to the layout in memory
- of the corresponding C struct. To handle platform-independent data formats
- or omit implicit pad bytes, use ``standard`` size and alignment instead of
- ``native`` size and alignment: see :ref:`struct-alignment` for details.
+ When no prefix character is given, native mode is the default. It
+ packs or unpacks data based on the platform and compiler on which
+ the Python interpreter was built.
+ The result of packing a given C struct includes pad bytes which
+ maintain proper alignment for the C types involved; similarly,
+ alignment is taken into account when unpacking. In contrast, when
+ communicating data between external sources, the programmer is
+ responsible for defining byte ordering and padding between elements.
+ See :ref:`struct-alignment` for details.
Several :mod:`struct` functions (and methods of :class:`Struct`) take a *buffer*
argument. This refers to objects that implement the :ref:`bufferobjects` and
@@ -102,10 +106,13 @@ The module defines the following exception and functions:
Format Strings
--------------
-Format strings are the mechanism used to specify the expected layout when
-packing and unpacking data. They are built up from :ref:`format-characters`,
-which specify the type of data being packed/unpacked. In addition, there are
-special characters for controlling the :ref:`struct-alignment`.
+Format strings describe the data layout when
+packing and unpacking data. They are built up from :ref:`format characters<format-characters>`,
+which specify the type of data being packed/unpacked. In addition,
+special characters control the :ref:`byte order, size and alignment<struct-alignment>`.
+Each format string consists of an optional prefix character which
+describes the overall properties of the data and one or more format
+characters which describe the actual data values and padding.
.. _struct-alignment:
@@ -116,6 +123,11 @@ Byte Order, Size, and Alignment
By default, C types are represented in the machine's native format and byte
order, and properly aligned by skipping pad bytes if necessary (according to the
rules used by the C compiler).
+This behavior is chosen so
+that the bytes of a packed struct correspond exactly to the memory layout
+of the corresponding C struct.
+Whether to use native byte ordering
+and padding or standard formats depends on the application.
.. index::
single: @ (at); in struct format strings
@@ -144,12 +156,10 @@ following table:
If the first character is not one of these, ``'@'`` is assumed.
-Native byte order is big-endian or little-endian, depending on the host
-system. For example, Intel x86 and AMD64 (x86-64) are little-endian;
-IBM z and most legacy architectures are big-endian;
-and ARM, RISC-V and IBM Power feature switchable endianness
-(bi-endian, though the former two are nearly always little-endian in practice).
-Use ``sys.byteorder`` to check the endianness of your system.
+Native byte order is big-endian or little-endian, depending on the
+host system. For example, Intel x86, AMD64 (x86-64), and Apple M1 are
+little-endian; IBM z and many legacy architectures are big-endian.
+Use :data:`sys.byteorder` to check the endianness of your system.
Native size and alignment are determined using the C compiler's
``sizeof`` expression. This is always combined with native byte order.
@@ -231,9 +241,9 @@ platform-dependent.
+--------+--------------------------+--------------------+----------------+------------+
| ``d`` | :c:expr:`double` | float | 8 | \(4) |
+--------+--------------------------+--------------------+----------------+------------+
-| ``s`` | :c:expr:`char[]` | bytes | | |
+| ``s`` | :c:expr:`char[]` | bytes | | \(9) |
+--------+--------------------------+--------------------+----------------+------------+
-| ``p`` | :c:expr:`char[]` | bytes | | |
+| ``p`` | :c:expr:`char[]` | bytes | | \(8) |
+--------+--------------------------+--------------------+----------------+------------+
| ``P`` | :c:expr:`void \*` | integer | | \(5) |
+--------+--------------------------+--------------------+----------------+------------+
@@ -292,8 +302,33 @@ Notes:
format <half precision format_>`_ for more information.
(7)
- For padding, ``x`` inserts null bytes.
-
+ When packing, ``'x'`` inserts one NUL byte.
+
+(8)
+ The ``'p'`` format character encodes a "Pascal string", meaning a short
+ variable-length string stored in a *fixed number of bytes*, given by the count.
+ The first byte stored is the length of the string, or 255, whichever is
+ smaller. The bytes of the string follow. If the string passed in to
+ :func:`pack` is too long (longer than the count minus 1), only the leading
+ ``count-1`` bytes of the string are stored. If the string is shorter than
+ ``count-1``, it is padded with null bytes so that exactly count bytes in all
+ are used. Note that for :func:`unpack`, the ``'p'`` format character consumes
+ ``count`` bytes, but that the string returned can never contain more than 255
+ bytes.
+
+(9)
+ For the ``'s'`` format character, the count is interpreted as the length of the
+ bytes, not a repeat count like for the other format characters; for example,
+ ``'10s'`` means a single 10-byte string mapping to or from a single
+ Python byte string, while ``'10c'`` means 10
+ separate one byte character elements (e.g., ``cccccccccc``) mapping
+ to or from ten different Python byte objects. (See :ref:`struct-examples`
+ for a concrete demonstration of the difference.)
+ If a count is not given, it defaults to 1. For packing, the string is
+ truncated or padded with null bytes as appropriate to make it fit. For
+ unpacking, the resulting bytes object always has exactly the specified number
+ of bytes. As a special case, ``'0s'`` means a single, empty string (while
+ ``'0c'`` means 0 characters).
A format character may be preceded by an integral repeat count. For example,
the format string ``'4h'`` means exactly the same as ``'hhhh'``.
@@ -301,15 +336,6 @@ the format string ``'4h'`` means exactly the same as ``'hhhh'``.
Whitespace characters between formats are ignored; a count and its format must
not contain whitespace though.
-For the ``'s'`` format character, the count is interpreted as the length of the
-bytes, not a repeat count like for the other format characters; for example,
-``'10s'`` means a single 10-byte string, while ``'10c'`` means 10 characters.
-If a count is not given, it defaults to 1. For packing, the string is
-truncated or padded with null bytes as appropriate to make it fit. For
-unpacking, the resulting bytes object always has exactly the specified number
-of bytes. As a special case, ``'0s'`` means a single, empty string (while
-``'0c'`` means 0 characters).
-
When packing a value ``x`` using one of the integer formats (``'b'``,
``'B'``, ``'h'``, ``'H'``, ``'i'``, ``'I'``, ``'l'``, ``'L'``,
``'q'``, ``'Q'``), if ``x`` is outside the valid range for that format
@@ -319,17 +345,6 @@ then :exc:`struct.error` is raised.
Previously, some of the integer formats wrapped out-of-range values and
raised :exc:`DeprecationWarning` instead of :exc:`struct.error`.
-The ``'p'`` format character encodes a "Pascal string", meaning a short
-variable-length string stored in a *fixed number of bytes*, given by the count.
-The first byte stored is the length of the string, or 255, whichever is
-smaller. The bytes of the string follow. If the string passed in to
-:func:`pack` is too long (longer than the count minus 1), only the leading
-``count-1`` bytes of the string are stored. If the string is shorter than
-``count-1``, it is padded with null bytes so that exactly count bytes in all
-are used. Note that for :func:`unpack`, the ``'p'`` format character consumes
-``count`` bytes, but that the string returned can never contain more than 255
-bytes.
-
.. index:: single: ? (question mark); in struct format strings
For the ``'?'`` format character, the return value is either :const:`True` or
@@ -345,18 +360,36 @@ Examples
^^^^^^^^
.. note::
- All examples assume a native byte order, size, and alignment with a
- big-endian machine.
+ Native byte order examples (designated by the ``'@'`` format prefix or
+ lack of any prefix character) may not match what the reader's
+ machine produces as
+ that depends on the platform and compiler.
+
+Pack and unpack integers of three different sizes, using big endian
+ordering::
-A basic example of packing/unpacking three integers::
+ >>> from struct import *
+ >>> pack(">bhl", 1, 2, 3)
+ b'\x01\x00\x02\x00\x00\x00\x03'
+ >>> unpack('>bhl', b'\x01\x00\x02\x00\x00\x00\x03'
+ (1, 2, 3)
+ >>> calcsize('>bhl')
+ 7
- >>> from struct import *
- >>> pack('hhl', 1, 2, 3)
- b'\x00\x01\x00\x02\x00\x00\x00\x03'
- >>> unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
- (1, 2, 3)
- >>> calcsize('hhl')
- 8
+Attempt to pack an integer which is too large for the defined field::
+
+ >>> pack(">h", 99999)
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in <module>
+ struct.error: 'h' format requires -32768 <= number <= 32767
+
+Demonstrate the difference between ``'s'`` and ``'c'`` format
+characters::
+
+ >>> pack("@ccc", b'1', b'2', b'3')
+ b'123'
+ >>> pack("@3s", b'123')
+ b'123'
Unpacked fields can be named by assigning them to variables or by wrapping
the result in a named tuple::
@@ -369,35 +402,132 @@ the result in a named tuple::
>>> Student._make(unpack('<10sHHb', record))
Student(name=b'raymond ', serialnum=4658, school=264, gradelevel=8)
-The ordering of format characters may have an impact on size since the padding
-needed to satisfy alignment requirements is different::
-
- >>> pack('ci', b'*', 0x12131415)
- b'*\x00\x00\x00\x12\x13\x14\x15'
- >>> pack('ic', 0x12131415, b'*')
- b'\x12\x13\x14\x15*'
- >>> calcsize('ci')
+The ordering of format characters may have an impact on size in native
+mode since padding is implicit. In standard mode, the user is
+responsible for inserting any desired padding.
+Note in
+the first ``pack`` call below that three NUL bytes were added after the
+packed ``'#'`` to align the following integer on a four-byte boundary.
+In this example, the output was produced on a little endian machine::
+
+ >>> pack('@ci', b'#', 0x12131415)
+ b'#\x00\x00\x00\x15\x14\x13\x12'
+ >>> pack('@ic', 0x12131415, b'#')
+ b'\x15\x14\x13\x12#'
+ >>> calcsize('@ci')
8
- >>> calcsize('ic')
+ >>> calcsize('@ic')
5
-The following format ``'llh0l'`` specifies two pad bytes at the end, assuming
-longs are aligned on 4-byte boundaries::
+The following format ``'llh0l'`` results in two pad bytes being added
+at the end, assuming the platform's longs are aligned on 4-byte boundaries::
- >>> pack('llh0l', 1, 2, 3)
+ >>> pack('@llh0l', 1, 2, 3)
b'\x00\x00\x00\x01\x00\x00\x00\x02\x00\x03\x00\x00'
-This only works when native size and alignment are in effect; standard size and
-alignment does not enforce any alignment.
-
.. seealso::
Module :mod:`array`
Packed binary storage of homogeneous data.
- Module :mod:`xdrlib`
- Packing and unpacking of XDR data.
+ Module :mod:`json`
+ JSON encoder and decoder.
+
+ Module :mod:`pickle`
+ Python object serialization.
+
+
+.. _applications:
+
+Applications
+------------
+
+Two main applications for the :mod:`struct` module exist, data
+interchange between Python and C code within an application or another
+application compiled using the same compiler (:ref:`native formats<struct-native-formats>`), and
+data interchange between applications using agreed upon data layout
+(:ref:`standard formats<struct-standard-formats>`). Generally speaking, the format strings
+constructed for these two domains are distinct.
+
+
+.. _struct-native-formats:
+
+Native Formats
+^^^^^^^^^^^^^^
+
+When constructing format strings which mimic native layouts, the
+compiler and machine architecture determine byte ordering and padding.
+In such cases, the ``@`` format character should be used to specify
+native byte ordering and data sizes. Internal pad bytes are normally inserted
+automatically. It is possible that a zero-repeat format code will be
+needed at the end of a format string to round up to the correct
+byte boundary for proper alignment of consective chunks of data.
+
+Consider these two simple examples (on a 64-bit, little-endian
+machine)::
+
+ >>> calcsize('@lhl')
+ 24
+ >>> calcsize('@llh')
+ 18
+
+Data is not padded to an 8-byte boundary at the end of the second
+format string without the use of extra padding. A zero-repeat format
+code solves that problem::
+
+ >>> calcsize('@llh0l')
+ 24
+
+The ``'x'`` format code can be used to specify the repeat, but for
+native formats it is better to use a zero-repeat format like ``'0l'``.
+
+By default, native byte ordering and alignment is used, but it is
+better to be explicit and use the ``'@'`` prefix character.
+
+
+.. _struct-standard-formats:
+
+Standard Formats
+^^^^^^^^^^^^^^^^
+
+When exchanging data beyond your process such as networking or storage,
+be precise. Specify the exact byte order, size, and alignment. Do
+not assume they match the native order of a particular machine.
+For example, network byte order is big-endian, while many popular CPUs
+are little-endian. By defining this explicitly, the user need not
+care about the specifics of the platform their code is running on.
+The first character should typically be ``<`` or ``>``
+(or ``!``). Padding is the responsibility of the programmer. The
+zero-repeat format character won't work. Instead, the user must
+explicitly add ``'x'`` pad bytes where needed. Revisiting the
+examples from the previous section, we have::
+
+ >>> calcsize('<qh6xq')
+ 24
+ >>> pack('<qh6xq', 1, 2, 3) == pack('@lhl', 1, 2, 3)
+ True
+ >>> calcsize('@llh')
+ 18
+ >>> pack('@llh', 1, 2, 3) == pack('<qqh', 1, 2, 3)
+ True
+ >>> calcsize('<qqh6x')
+ 24
+ >>> calcsize('@llh0l')
+ 24
+ >>> pack('@llh0l', 1, 2, 3) == pack('<qqh6x', 1, 2, 3)
+ True
+
+The above results (executed on a 64-bit machine) aren't guaranteed to
+match when executed on different machines. For example, the examples
+below were executed on a 32-bit machine::
+
+ >>> calcsize('<qqh6x')
+ 24
+ >>> calcsize('@llh0l')
+ 12
+ >>> pack('@llh0l', 1, 2, 3) == pack('<qqh6x', 1, 2, 3)
+ False
.. _struct-objects:
@@ -411,9 +541,9 @@ The :mod:`struct` module also defines the following type:
.. class:: Struct(format)
Return a new Struct object which writes and reads binary data according to
- the format string *format*. Creating a Struct object once and calling its
- methods is more efficient than calling the :mod:`struct` functions with the
- same format since the format string only needs to be compiled once.
+ the format string *format*. Creating a ``Struct`` object once and calling its
+ methods is more efficient than calling module-level functions with the
+ same format since the format string is only compiled once.
.. note::