[Translation] Highlights in Git 2.31

[Translation] Highlights in Git 2.31

Highlights in Git 2.31

The open source project Git recently released version 2.31 , with 85 contributors bringing new features and bug fixes, of which 23 are new entrants. The last time we synchronized the version update with you was when Git 2.29 was just released. Since version 2.29, Git has gone through two more version iterations, so let's take a look at the most interesting features and changes.

Introduce
Git maintenance

Imagine this: you open the terminal, and you are submitting, pulling from other warehouses, and pushing the final results to the remote end, but suddenly, you bump into this unwelcome message:

Auto packing the repository for optimum performance. You may also
run "git gc" manually. See "git help gc" for more information.

Then, you are stuck here. Now, you can only wait for Git to finish running

git gc --auto
Before you can continue.

How is this going? In normal usage scenarios, Git writes large amounts of data: objects, package files, references, and so on. For some of the data paths, Git will optimize write performance. For example, writing to a "loose" object is indeed faster, but reading a package file is faster.

In order to keep you high efficiency, Git has coordinated: Usually, it will optimize the write path during your operation, that is, it will pause from time to time to make its internal data structure read more efficiently, the purpose is to make you Maintain long-term efficient output.

Git has its own algorithms to determine when it is appropriate to perform this "pause", but sometimes those algorithms may trigger blocking at an inappropriate time

git gc
operating. Although you can manage these data structures yourself, you may not want to waste time deciding when and how to deal with them.

Starting from Git 2.31, background maintenance allows you to have both fish and bear's paws. This cross-platform feature keeps the warehouse running well without blocking any interaction. It is worth mentioning that Git will pre-fetch the latest objects from the remote end once an hour, which will effectively shorten the execution

git fetch
Time-consuming.

Getting started with the background maintenance function couldn't be easier. Just switch to the warehouse where you want to use the background maintenance function in the terminal, and then run the following command:

$ Git maintenance start Copy the code

Git will do the remaining work. In addition to pre-pulling the latest objects every hour, Git also ensures that its own data is also in order. It will be updated every hour

File , and pack loose objects every night (and repack objects that have already been packed).

in

In the documentation , you can read more about this feature and learn how to use it
maintenance.* config
Options to
customize the function. If you encounter difficulties, you can consult the troubleshooting documentation .

[ Source code , source code , source code , source code ]

Inverted index on local disk

As you may already know, Git stores all data in the form of "objects": commits, trees, and Blob files that store the contents of each file. For efficiency reasons, Git puts many objects in a package file, and the package file is essentially a series of object streams (

git fetch
with
git push
The transmission object is also based on this stream). In order to be able to access these objects efficiently, Git generates an index for each package file. These ones
.idx
The file allows the object ID to be quickly converted to the corresponding byte offset in the package file.

What if we want to visit in reverse? Furthermore, if Git only knows which byte it is looking for in the package file, how does it know which object that byte belongs to?

To do this, Git uses an aptly named reverse index : an opaque mapping that associates the locations in the package file and which object each location belongs to. Before Git 2.31, there was no disk file format for reverse indexing (like

.idx
File format), so each time it needs to be stored in the memory after the inverted index is generated. This inverted index can be roughly regarded as generating an array of "object-position" pairs, and then sorting the array by position (curious readers can find specific details here ).

But such an operation takes time. If the package file in the warehouse is large, the process will be very long. In order to better understand the impact of volume on time, we can do an experiment to compare the time it takes to print the size and content of the same object. When only printing the contents of one object, Git uses forward indexing to locate the target object in the package file. But if you want to print the size of an object in the package file , Git not only needs to locate the target object, but also locate the object that follows it, and then subtract the two positions to get how much space the target object occupies. In order to find the position of the first byte of an adjacent object, Git needs to use a reverse index.

Comparing the two, it can be found that the size of the printed object is 62 times slower than the content of the entire object . You can try it with hyperfine :

$ git rev-parse HEAD >tip $ hyperfine --warmup=3/ 'CAT-File --batch Git <Tip'/ 'Git-CAT-Check File --batch = "% (of objectsize: Disk)" <Tip' duplicated code

In version 2.31, Git can finally serialize the reverse index into a new disk file format. The file extension of this format is

.rev
. After generating the inverted index disk file, we repeat the above experiment again. The results this time show that the time spent printing the content and size of the same object is almost the same.

An insightful reader may wonder why Git has to spend a lot of time using inverted indexing. After all, if you can already print out the content of the object, then printing its size will certainly not be difficult to calculate how many keystrokes were hit while printing the content. However, this also depends on the size of the object. If the object is very large, calculating how many bytes it has in total is more expensive than simply subtracting.

In addition to the aforementioned kind of human experimentation, inverted indexes are also very useful in other places. For example, when passing objects in the process of Fetch or Push, the inverted index is used to send the object bytes directly from the disk. . Calculating the inverted index in advance can make this process run faster.

Git does not generate by default

.rev
File, but you can try it yourself like this: run
git config pack.writeReverseIndex true
, And then repack the warehouse (using
git repack -Ad
). In the past few months, we have used this in GitHub, which has significantly improved the experience of many Git operations.

[ Source code , source code ]

Tidbits

  • In the previous article, we have already mentioned

    commit-graph
    File. This is a very useful information sequence that contains common information about submissions, such as who is whose parent submission node, who is whose root node, and so on. (If you want to go into more details, the series of articles here provide a very detailed explanation). The submission record map also stores the generation serial number information of each submission , which helps to speed up the various submission walk (Walk) process. Git 2.31 uses a new generation serial number, which can further improve performance in certain scenarios. This part of the code was contributed by Abhishek Kumar , a student in the Google Summer of Code .

    [ Source code ]

  • In recent versions of Git, with the help of

    Configuration items , it is easier to change the default name of the main branch in the new warehouse. Git has always tried to check out remote warehouses in the past
    HEAD
    The branch pointed to (for example: if the default branch of the remote end is "
    foo
    ", then execute
    git clone
    When, Git will try to
    foo
    Branch checkout to local), but this does not work for empty warehouses. In Git 2.31, this operation also applies to empty warehouses. Now, if you clone a newly created warehouse locally and then start writing the first piece of code, then the copied version in your local will follow the default branch name of the remote warehouse, even if the remote has no commit records.

    [ Source code ]

  • Speaking of renaming, Git 2.30 also makes it easier to change another default name: the name of the first remote branch of the repository. When you clone a repository, the first initial remote branch is always called "origin". Before Git 2.30, if you want to modify, you can only run

    git remote rename origin <newname>
    . Git 2.30 will let you choose whether to configure a custom name by default, instead of always using "origin". You can try the settings yourself
    clone.defaultRemoteName
    Configuration item.

    [ Source code ]

  • When a warehouse becomes larger and larger, it will be difficult to determine which branches are the main branches. In Git 2.31,

    git rev-list
    Got one
    --disk-usage
    Option, calculating the size of the object is easier and faster than using the original tool.
    rev-list
    The example part of the manual shows us some use cases (in the timing part of the source link below, you can see the "traditional" way of this operation).

    [ Source code ]

  • You may have used

    -G<regex>
    Option to find modified specific code characters (for example:
    git log -G'foo\('
    Can find those that involve
    foo()
    Changes to function calls, whether they are added, deleted, or modified) are submitted. But you may also want to ignore changes that match a particular pattern. Git 2.30 introduced
    -I<regex>
    , It allows you to ignore those code changes that match specific regular expressions. such as,
    git log -p -I'//'
    Will omit only modified comments (including
    //
    Part).

    [ Source code ]

  • In order to pave the way for Merge backend, the rename detection mechanism has also been significantly optimized. For more details, please refer to the code author s article Optimizing git's merge machinery, #1 and Optimizing git's merge machinery, #2 .

The above is just a glimpse of the latest updates. If you want to know more about the update, you can read the 2.30 , 2.31 or earlier release notes in the Git repository .

If you find there is a translation error or other areas for improvement, welcome to Denver translation program to be modified and translations PR, also obtained the corresponding bonus points. The beginning of the article Permalink article is the MarkDown the links in this article on GitHub.


Nuggets Translation Project is a high-quality translation of technical articles Internet community, Source for the Nuggets English Share article on. Content covering Android , iOS , front-end , back-end , block chain , product , design , artificial intelligence field, etc., you want to see more high-quality translations, please continue to focus Nuggets translation program , the official micro-blog , we know almost columns .