The magical python pickle module

The magical python pickle module

background

Persistence refers to keeping objects, even between multiple executions of the same program. Through this article, you will have a general understanding of the various persistence mechanisms of Python objects (from relational databases to Python's pickle and other mechanisms). In addition, it will give you a deeper understanding of Python's object serialization capabilities.

What is persistence?

The basic idea of persistence is simple. Suppose there is a Python program. It may be a program that manages daily to-do items. You want to save application objects (to-do items) between multiple executions of this program. In other words, you want to store the object on disk so that it can be retrieved later. This is persistence. To achieve this goal, there are several methods, each of which has its advantages and disadvantages.

For example, the object data can be stored in a text file in a certain format, such as a CSV file. Or you can use a relational database, such as Gadfly, MySQL, PostgreSQL, or DB2. These file formats and databases are excellent, and Python has a robust interface for all these storage mechanisms.

These storage mechanisms have one thing in common: the stored data is independent of the objects and programs that operate on the data. The advantage of this is that the data can be used as a shared resource for other applications. The disadvantage is that in this way, other programs can be allowed to access the object's data, which violates the object-oriented encapsulation principle-that is, the object's data can only be accessed through the object's own public (public) interface.

In addition, for some applications, the relational database approach may not be ideal. In particular, relational databases do not understand objects. On the contrary, the relational database will force its own type system and relational data model (table). Each table contains a set of tuples (rows), and each row contains a fixed number of static type fields (columns). If the object model of the application cannot be easily converted to the relational model, it will be difficult to map objects to tuples and tuples back to objects. This difficulty is often referred to as the impedence-mismatch problem.

Object persistence

If you want to store Python objects transparently without losing information such as their identity and type, you need some form of object serialization: it is a process of turning arbitrarily complex objects into text or binary representations of objects. Similarly, it must be possible to restore the serialized form of the object to the original object. In Python, this serialization process is called pickle. You can pickle objects into strings, files on disk, or any file-like objects, or unpickle these strings, files, or any file-like objects into The original object. We will discuss pickle in detail later in this article.

Suppose you like to save everything as an object, and you want to avoid the overhead of converting the object into some kind of non-object storage; then pickle files can provide these benefits, but sometimes you may need to be more robust and more robust than this simple pickle file Things that are scalable. For example, pickle alone cannot solve the problems of naming and finding pickled files. In addition, it cannot support concurrent access to persistent objects. If you need these features, you should ask for a database like ZODB (Z Object Database for Python). ZODB is a robust, multi-user and object-oriented database system, which can store and manage arbitrarily complex Python objects, and supports transaction operations and concurrency control. (See Resources to download ZODB.) What is interesting enough is that even ZODB relies on Python's native serialization capabilities, and to use ZODB effectively, you must fully understand pickle.

Another interesting solution to the persistence problem is Prevayler, which was originally implemented in Java (for the developerWorks article on Prevaylor, see Resources). Recently, a group of Python programmers ported Prevayler to Python and named it PyPerSyst, hosted by SourceForge (for a link to the PyPerSyst project, see Resources). The concept of Prevayler/PyPerSyst is also based on the native serialization capabilities of Java and Python languages. PyPerSyst saves the entire object system in memory and provides disaster recovery by pickling system snapshots to disk from time to time and maintaining a command log through which the latest snapshot can be reapplied. Therefore, although applications using PyPerSyst are limited by available memory, the advantage is that the native object system can be completely loaded into memory, so the speed is extremely fast, and it is simple to implement a database such as ZODB. ZODB allows the number of objects More objects than can be held in memory at the same time.

Now that we have briefly discussed the various methods of storing persistent objects, it is time to discuss the pickle process in detail. Although we are mainly interested in exploring various ways to save Python objects without having to convert them into some other format, we still have some areas to pay attention to, such as how to effectively pickle and unpickle simple objects and Complex objects, including instances of custom classes; how to maintain object references, including circular references and recursive references; and how to deal with changes in class definitions, so that there will be no problems when using previously pickled instances. We will cover all these issues in the subsequent discussion on Python's pickle capabilities. Some pickled Python pickle modules and similar modules cPickle provide pickle support to Python. The latter is coded in C, which has better performance. For most applications, this module is recommended. We will continue to discuss pickle, but the examples in this article actually make use of cPickle. Since most of the examples will be displayed using the Python shell, first show how to import cPickle, and you can refer to it as pickle:

>>> import cPickle as pickle copy the code

Now that the module has been imported, let's take a look at the pickle interface. The pickle module provides the following function pairs: dumps(object) returns a string containing an object in pickle format; loads(string) returns the object contained in the pickle string; dump(object, file) writes the object to a file This file can be an actual physical file, but it can also be any object similar to a file. This object has a write() method that can accept a single string parameter; load(file) returns the object contained in the pickle file.

By default, dumps() and dump() use printable ASCII representation to create pickles. Both have a final parameter (optional). If True, this parameter specifies a faster and smaller binary representation to create a pickle. The loads() and load() functions automatically detect whether the pickle is in binary format or text format.

Listing 1 shows an interactive session, using the dumps() and loads() functions just described:

Listing 1. Demonstration of dumps() and loads()

>>> import cPickle as pickle >>> t1 = ('this is a string', 42, [1, 2, 3], None) >>> t1 ('this is a string', 42, [1, 2, 3], None) >>> p1 = pickle.dumps(t1) >>> p1 "(S'this is a string'/nI42/n(lp1/nI1/naI2/naI3/naNtp2/n." >>> print p1 (S'this is a string' I42 (lp1 I1 aI2 aI3 aNtp2 . >>> t2 = pickle.loads(p1) >>> t2 ('this is a string', 42, [1, 2, 3], None) >>> p2 = pickle.dumps(t1, True) >>> p2 '(U/x10this is a stringK*)q/x01(K/x01K/x02K/x03eNtq/x02.' >>> t3 = pickle.loads(p2) >>> t3 Copy code

('this is a string', 42, [1, 2, 3], None)

Note: The text pickle format is very simple and will not be explained here. In fact, all the conventions used are documented in the pickle module. We should also point out that all simple objects are used in our example, so using the binary pickle format will not show much efficiency in saving space. However, in a system that actually uses complex objects, you will see that using a binary format can bring significant improvements in size and speed. Next, let's look at some examples. These examples use dump() and load(), which use files and file-like objects. The operation of these functions is very similar to the dumps() and loads() we have just seen. The difference is that they have another capability the dump() function can dump several objects to the same file one by one. . Then load() is called to retrieve these objects in the same order. Listing 2 shows the practical application of this capability:

Listing 2. dump() and load() examples

>>> a1 ='apple' >>> b1 = {1:'One', 2:'Two', 3:'Three'} >>> c1 = ['fee','fie','foe','fum'] >>> f1 = file('temp.pkl','wb') >>> pickle.dump(a1, f1, True) >>> pickle.dump(b1, f1, True) >>> pickle.dump(c1, f1, True) >>> f1.close() >>> f2 = file('temp.pkl','rb') >>> a2 = pickle.load(f2) >>> a2 'apple' >>> b2 = pickle.load(f2) >>> b2 {1:'One', 2:'Two', 3:'Three'} >>> c2 = pickle.load(f2) >>> c2 ['fee','fie','foe','fum'] >>> f2.close() Copy code

The power of Pickle

So far, we have told the basic knowledge about pickle. In this section, we will discuss some advanced issues that you will encounter when you start pickling complex objects, including instances of custom classes. Fortunately, Python can easily handle this situation.

##Portability In terms of space and time, Pickle is portable. In other words, the pickle file format is independent of the machine's architecture, which means, for example, you can create a pickle under Linux and then send it to a Python program running under Windows or Mac OS. And, when upgrading to a newer version of Python, you don't have to worry about discarding the existing pickle. Python developers have guaranteed that the pickle format will be backward compatible with all versions of Python. In fact, detailed information about the current and supported formats is provided in the pickle module. ###List 3. Retrieve supported formats

>>> pickle.format_version '1.3' >>> pickle.compatible_formats ['1.0', '1.1', '1.2'] Copy code

Multiple references, same object

In Python, variables are references to objects. At the same time, multiple variables can also be used to refer to the same object. It has been proven that Python has no difficulty maintaining this behavior with pickled objects, as shown in Listing 4:

###List 4. Maintenance of object references

>>> a = [1, 2, 3] >>> b = a >>> a [1, 2, 3] >>> b [1, 2, 3] >>> a.append(4) >>> a [1, 2, 3, 4] >>> b [1, 2, 3, 4] >>> c = pickle.dumps((a, b)) >>> d, e = pickle.loads(c) >>> d [1, 2, 3, 4] >>> e [1, 2, 3, 4] >>> d.append(5) >>> d [1, 2, 3, 4, 5] >>> e [1, 2, 3, 4, 5] Copy code

Circular references and recursive references

The object reference support just demonstrated can be extended to circular references (two objects each contain a reference to each other) and recursive references (an object contains a reference to itself). The following two lists highlight this capability. Let's look at recursive references first:

Listing 5. Recursive references

>>> l = [1, 2, 3] >>> l.append(l) >>> l [1, 2, 3, [...]] >>> l[3] [1, 2, 3, [...]] >>> l[3][3] [1, 2, 3, [...]] >>> p = pickle.dumps(l) >>> l2 = pickle.loads(p) >>> l2 [1, 2, 3, [...]] >>> l2[3] [1, 2, 3, [...]] >>> l2[3][3] [1, 2, 3, [...]] Copy code

Now, look at an example of circular references:

Listing 6. Circular reference

>>> a = [1, 2] >>> b = [3, 4] >>> a.append(b) >>> a [1, 2, [3, 4]] >>> b.append(a) >>> a [1, 2, [3, 4, [...]]] >>> b [3, 4, [1, 2, [...]]] >>> a[2] [3, 4, [1, 2, [...]]] >>> b[2] [1, 2, [3, 4, [...]]] >>> a[2] is b 1 >>> b[2] is a 1 >>> f = file('temp.pkl','w') >>> pickle.dump((a, b), f) >>> f.close() >>> f = file('temp.pkl','r') >>> c, d = pickle.load(f) >>> f.close() >>> c [1, 2, [3, 4, [...]]] >>> d [3, 4, [1, 2, [...]]] >>> c[2] [3, 4, [1, 2, [...]]] >>> d[2] [1, 2, [3, 4, [...]]] >>> c[2] is d 1 >>> d[2] is c 1 Copy code

Note that if you pickle each object separately, instead of pickling all objects together in a tuple, you will get slightly different (but important) results, as shown in Listing 7:

Listing 7. Pickle separately vs. pickle together in a tuple

>>> f = file('temp.pkl','w') >>> pickle.dump(a, f) >>> pickle.dump(b, f) >>> f.close() >>> f = file('temp.pkl','r') >>> c = pickle.load(f) >>> d = pickle.load(f) >>> f.close() >>> c [1, 2, [3, 4, [...]]] >>> d [3, 4, [1, 2, [...]]] >>> c[2] [3, 4, [1, 2, [...]]] >>> d[2] [1, 2, [3, 4, [...]]] >>> c[2] is d 0 >>> d[2] is c 0 Copy code

Equal, but not always the same

As implied in the previous example, these objects are only the same if they refer to the same object in memory. In the case of pickle, each object is restored to an object equal to the original object, but not the same object. In other words, each pickle is a copy of the original object:

Listing 8. The restored object as a copy of the original object

>>> j = [1, 2, 3] >>> k = j >>> k is j 1 >>> x = pickle.dumps(k) >>> y = pickle.loads(x) >>> y [1, 2, 3] >>> y == k 1 >>> y is k 0 >>> y is j 0 >>> k is j 1 Copy code

At the same time, we see that Python can maintain references between objects, and these objects are pickled as a unit. However, we have also seen that calling dump() separately prevents Python from maintaining references to objects that are pickled outside the unit. Instead, Python copies the referenced object and stores the copy with the pickled object. For applications that pickle and restore a single object hierarchy, this is no problem. But be aware that there are other situations.

It is worth pointing out that there is an option that does allow to pickle objects separately and maintain references to each other, as long as these objects are all pickled to the same file. The pickle and cPickle modules provide a Pickler (corresponding to Unpickler), which can track objects that have been pickled. By using this Pickler, sharing and circular references will be pickled by reference instead of by value:

Listing 9. Maintaining references between separately pickled objects

>>> f = file('temp.pkl','w') >>> pickler = pickle.Pickler(f) >>> pickler.dump(a) <cPickle.Pickler object at 0x89b0bb8> >>> pickler.dump(b) <cPickle.Pickler object at 0x89b0bb8> >>> f.close() >>> f = file('temp.pkl','r') >>> unpickler = pickle.Unpickler(f) >>> c = unpickler.load() >>> d = unpickler.load() >>> c[2] [3, 4, [1, 2, [...]]] >>> d[2] [1, 2, [3, 4, [...]]] >>> c[2] is d 1 >>> d[2] is c 1 Copy code

Unpickled objects

Some object types are not pickleable. For example, Python cannot pickle a file object (or any object with a reference to a file object), because Python cannot guarantee that it can reconstruct the state of the file when unpickled (another example is more difficult to understand, and it is not worth mentioning in this type of article. come out). Attempting to pickle a file object will result in the following error:

Listing 10. The result of trying to pickle a file object

>>> f = file('temp.pkl','w') >>> p = pickle.dumps(f) Traceback (most recent call last): File "<input>", line 1, in? File "/usr/lib/python2.2/copy_reg.py", line 57, in _reduce raise TypeError, "can't pickle %s objects"% base.__name__ TypeError: can't pickle file objects Copy code

Class instance

pickle pickle Python pickle dict pickle Python unpickle pickle

unpickle init() Python pickle class

Python 2.2 unpickle Python copy_reg _reconstructor()

pickle getstate() setstate() Python

persist.py Python

11.

class Foo(object): def __init__(self, value): self.value = value

pickle Foo

12. pickle Foo

>>> import cPickle as pickle >>> from Orbtech.examples.persist import Foo >>> foo = Foo('What is a Foo?') >>> p = pickle.dumps(foo) >>> print p ccopy_reg _reconstructor p1 (cOrbtech.examples.persist Foo p2 c__builtin__ object p3 NtRp4 (dp5 S'value' p6 S'What is a Foo?' sb.

Foo Orbtech.examples.persist pickle pickle unpickle unpickle Python Orbtech.examples.persist

Python Foo pickle Foo

13. Foo pickle

>>> import cPickle as pickle >>> f = file('temp.pkl', 'r') >>> foo = pickle.load(f) Traceback (most recent call last): File "<input>", line 1, in ? AttributeError: 'module' object has no attribute 'Foo'

persist.py

14. persist.py pickle

>>> import cPickle as pickle >>> f = file('temp.pkl', 'r') >>> foo = pickle.load(f) Traceback (most recent call last): File "<input>", line 1, in ? ImportError: No module named persist

pickle

pickle pickle getstate() setstate() Foo

15. pickle

class Foo(object): def __init__(self, value, filename): self.value = value self.logfile = file(filename, 'w') def __getstate__(self): """Return state values to be pickled.""" f = self.logfile return (self.value, f.name, f.tell()) def __setstate__(self, state): """Restore state from the unpickled state values.""" self.value, name, position = state f = file(name, 'w') f.seek(position) self.logfile = f

pickle Foo Python pickle getstate() unpickle Python unpickle setstate() setstate() pickle logfile

pickle pickle unpickle hook

pickle pickle unpickle unpickle

pickle unpickle NewClassName ### 16.

def __setstate__(self, state): self.__dict__.update(state) self.__class__ = NewClassName

When unpickle an existing instance, Python will look for the definition of the original class and call the setstate () method of the instance , and at the same time will reassign the class attribute of the instance to the new class definition . Once it is determined that all existing instances have been unpickled, updated, and re-pickled, the old class definition can be removed from the source code module. ###Attribute addition and deletion These special state methods getstate () and setstate () once again enable us to control the state of each instance and give us the opportunity to deal with changes in instance attributes. Let's look at the definition of a simple class to which we will add and remove some attributes. This is the original definition:

Listing 17. Initial class definition

class Person(object): def __init__(self, firstname, lastname): self.firstname = firstname self.lastname = lastname Copy code

Assuming that an instance of Person has been created and pickled, we now decide that we really want to store only a name attribute instead of storing the last name and first name separately. Here is a way to change the definition of a class, which migrates the previously pickled instance to the new definition:

Listing 18. New class definition

class Person(object): def __init__(self, fullname): self.fullname = fullname def __setstate__(self, state): if'fullname' not in state: first ='' last ='' if'firstname' in state: first = state['firstname'] del state['firstname'] if'lastname' in state: last = state['lastname'] del state['lastname'] self.fullname = "".join([first, last]).strip() self.__dict__.update(state) Copy code

In this example, we added a new attribute fullname and removed the two existing attributes firstname and lastname. When unpickle is performed on an instance that has been previously pickled, its previously pickled state will be passed to setstate () as a dictionary , which will include the values of the firstname and lastname attributes. Next, combine these two values and assign them to the new attribute fullname. In this process, we deleted the old attributes in the state dictionary. After updating and re-pickling all instances that have previously been pickled, the setstate () method can now be removed from the class definition . ###Modification of the module Conceptually, the change of the name or location of the module is similar to the change of the class name, but the processing method is completely different. That's because the module information is stored in pickle instead of attributes that can be modified through the standard pickle interface. In fact, the only way to change module information is to perform find and replace operations on the actual pickle file itself. As for how to do it exactly, it depends on the specific operating system and available tools. Obviously, in this case, you will want to back up your files to avoid errors. But this change should be very simple, and changes to the binary pickle format should be as effective as changes to the text pickle format.

Reposted from CobbLiu's blog