Use Python to process download videos from station B

Use Python to process download videos from station B

Bilibili (Bilibili, also known as Station B) is an AGC-related barrage video sharing website launched in June 2009. It is an entertainment community for young people s trend and culture. It may be for children who have heard of it but don t often go to Station B. In terms of the biggest impact on station b are the two-dimensional, animation, barrage and so on. But as a well-known barrage video website in China, station b is not only limited to animation, but also has a wealth of learning resources./

Station B icon

The author himself often searches for some video resources related to artificial intelligence and machine learning at station b. They are often downloaded on mobile phones and watched offline. For the convenience of computer viewing, they also use APPs such as "Video Merge Assistant" to convert videos. Then I imported it to the computer for viewing. It was the Spring Festival holiday to download the video again. I wanted to import it to my computer for viewing. I found that the previous video conversion APP had failed and I could not search for the video resources downloaded to the phone at station b. Then I started the video synthesis work described below .

The basic idea

  • Purpose: Synthesize the files cached by the Bilibili APP to the phone and convert them to MP4 format

  • The basic idea:

    1. Analysis of the file directory structure and file cache   2. Use a synthetic library file copy the code
  • Development environment:

    1. phone: Huawei Mate20x EMUI10 beep beep miles miles APP Version: 5.53 .1 2. Development Environment: MacBook Pro  2015 , python3 .7 .6  64- -bit, Visual Studio Code   1.41 .1 copy the code

0x00 Bilibili APP cache file directory structure and file analysis

Open the phone file manager and find the Android/data/tv.danmaku.bili/down folder, the structure is as shown in the figure below:

app directory

The folder named with an 8-digit number is used for the storage of a single video album. The first-level sub-directories are named incrementally starting from the number 1. Each directory stores the cached section files (which can be understood as each episode) . The second-level sub-directories are named after the number 16, which is quite regular.

There are the following files under each video album:

1. There are danmaku.xml and entry.json in the first-level subdirectory, among which danmaku.xml is the danmaku file

<?xml version= "1.0"  encoding= "UTF-8" ?> <i>     <chatserver></chatserver>     <chatid> 132379211 </chatid>     <mission> 0 </mission>     <maxlimit> 3000 </maxlimit>     <state> 0 </state>     <real_name> 0 </real_name>     <source>kv</source>     <d p= "22.23400,1,25,16777215,1575199941,0,aaaeeaeb,25196110486700034" >Ground air</d>     <d p= "1318.88600,1,25,16777215,1578391805,0,48b91c28,26869566679810052" >The specified version cannot be installed, only the latest version</d>     <d p= "582.62400,1,25,16777215,1578964914,0,15eedcf5,27170040630476802" >nice</d>     <d p= "26.29000,1,25,16777215,1579009720,0,c1d89d8e,27193531775320068" >Hahaha indeed</d> </i> Copy code

The entry.json file is the description file about the cached video:

{      "media_type"2 ,      "has_dash_audio"true ,      "is_completed"true ,      "total_bytes"21176174 ,      "downloaded_bytes"21176174 ,      "title""(Full) Python-based Opencv project actual combat" ,      "type_tag""16" ,      "cover""http:\/\/\/bfs\/archive\/afae181e4bb00d7ca2e97f192e6f11dc2c3d8142.jpg" ,      "prefered_video_quality"16 ,      "guessed_total_bytes"0 ,      "total_time_milli"1152336 ,      "danmaku_count"0 ,      "time_update_stamp"1580398289030 ,      "time_create_stamp"1580348758458 ,      "avid"77390697 ,      "spid"0 ,      "seasion_id"0 ,      "bvid"" " ,      "page_data" : {          "cid"132379572 ,         "page"6,          "from""vupload" ,          "part""06. Edge detection" ,          "link""" ,          "rich_vid""" ,          "vid""" ,          "has_alias"false ,          "weblink""" ,          "offsite""" ,          "tid"39 ,          "width"960 ,          "height"540 ,         "rotate"0 ,          "download_title""Video has been cached" ,          "download_subtitle""(All) Python-based Opencv project combat 06, edge detection"     } } Copy code

We need to extract the "download_subtitle" field from this json as the name of the file.
2. There are 3 files in the secondary subdirectory named "16". From the file name, you can judge that audio.m4s and vedio.m4s should be cached audio and video files. We can try to use the player After playing these two files, it is found that they can be played successfully but there is no sound in the video. It can be concluded that station B has stored the audio and video of a video separately./



There is one index.json file left

{      "video" : [{          "id"16 ,          "base_url""https:\/\/\/upgcxcode\/11\/92\/132379211\/132379211-1 -30015.m4s? E = ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEuENvNC8aNEVEtEvE9IMvXBvE2ENvNCImNEVEIj0Y2J_aug859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859IB_ & uipk =. 5 & NBS =. 1 & DEADLINE = 1,580,403,879 & Gen = playurl & OS = hwbv & OI = 611 071 466 & TRID = 69dea77b6c6a4049a6f88615e82ada05u & Platform = Android & upsig = 1e3e3eedc71aba8b50ce51e67f3ca508 & uparams = E, uipk, NBS, DEADLINE, Gen, OS, OI, TRID, Platform & MID = 280 178 137 " ,          " backup_url " : ["https://// e = ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEuENvNC8aNEVEtEvE9IMvXBvE2ENvNCImNEVEIj0Y2J_aug859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859IB_ & uipk = 5 & nbs = 1 & deadline = 1580403879 & gen = playurl & os =? & OI = 611 071 466 & ks3bv TRID = 69dea77b6c6a4049a6f88615e82ada05u & Platform Android & upsig = = = E 6de0ffa1809d46318ca36387ca8d8634 & uparams, uipk, NBS, DEADLINE, Gen, OS, OI, TRID, Platform & MID = 280 178 137 " ],          " bandwidth " 104 293 ,          " CodecID ". 7 ,          " size "19,069,578 ,          "md5""eab8c79d8ab56a973626a20e1dee6c25"     }],     "Audio" : [{          "id"30216 ,          "base_url""https:\/\/\/upgcxcode\/11\/92\/132379211\/132379211-1- 30216.m4s? E = ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEuENvNC8aNEVEtEvE9IMvXBvE2ENvNCImNEVEIj0Y2J_aug859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859IB_ & uipk =. 5 & NBS =. 1 & DEADLINE = 1580403879 & Gen = playurl & OS = kodobv & OI = 611 071 466 & TRID = 69dea77b6c6a4049a6f88615e82ada05u & Platform = Android & upsig = 59e394e791a6ccb7c32b7d2eb1f0957d & uparams = E, uipk, NBS, DEADLINE, Gen, OS, OI, TRID, Platform & MID = 280 178 137 " ,          " backup_url " : ["HTTPS:?//// 11//92//132 379 211//132379211-1-30216.m4s = E = ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEuENvNC8aNEVEtEvE9IMvXBvE2ENvNCImNEVEIj0Y2J_aug859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859IB_ & uipk. 1. 5 & NBS = & DEADLINE = 1580403879 & Gen = playurl & OS = ks3bv & OI = 611 071 466 & TRID = 69dea77b6c6a4049a6f88615e82ada05u & Platform = Android & upsig = 856c6fed7b7f7c1a4967a8e40cf8fc59 & uparams = E, uipk, NBS, DEADLINE, Gen, OS, OI, TRID, Platform & MID = 280 178 137 " ],          " bandwidth "67113 ,          " CodecID " 0 ,          " size "12,272,062 ,          "md5""5d7a6a8e6f4c2809ac61eeafa1d9eaae"     }] } Copy code

This json file contains information about audio files and video files.

0x01 Synthetic audio and video files

Through the above analysis, we found the cache file of a single album. The next thing we need to do is to merge the audio track into the video. For this we need to use the Moviepy library. MoviePy is a python module for video editing. You can use it to implement some basic operations (such as video editing, video splicing, inserting titles), video synthesis, and video processing, or use it to add some customization The advanced special effects. In addition, MoviePy can read and write most common video formats, even GIF format! For detailed instructions, please refer to MoviePy-Chinese document and official document.
1. you need to install the Moviepy library. You can install it directly using pip. The required dependent libraries such as numpy will be automatically downloaded and configured during installation:

pip install moviepy copy the code

You can use it after installation, let's try it with a single file first, the code is as follows:

from moviepy.editor  import  VideoFileClip,AudioFileClip #Import the editor package from moviepy audioFile = r "/Users/airwolf/Desktop/81427329/1/16/audio.m4s" #Specify    the audio file that needs to be read videoFile = r "/Users/airwolf/Desktop/81427329/1/16/video.m4s" #Specify    the video file that needs to be read outputfile = r "/Users/airwolf/Desktop/81427329/output.mp4" #Specify the        output file video_in = VideoFileClip(videoFile) #Read video file audio_in =AudioFileClip(audioFile) #Read audio file video_out = video_in.set_audio(audio) #The output of the video_out file is to merge the audio file into the audio track of the video_in file video.write_videofile(outputfile) #output video_out file Copy code

deal with

At this point, we use the player to open the input file output.mp4 and find that the audio file has been synthesized into the video file./

Synthetic information

Next, we will start batch file synthesis, first encapsulate the audio and video synthesis method into a function:

def set_audio(proc_file, output_path):     (file_name, audio_file, vedio_file) = proc_file     file_name = file_name.replace( '.''-' ).replace( '"'"" ).replace( '"'"" ) #Process the special characters contained in the file name that affect naming     original_vedio = VideoFileClip(vedio_file)     audio = AudioFileClip(audio_file)     video = original_vedio.set_audio(audio)     outputfile = os.path.join(output_path, file_name)+ ".mp4" #Form the output file name     video.write_videofile(outputfile) Copy code

The input of the function has 2 parameters. The parameter proc_file represents the file information to be processed, which is input in the form of a list of [file name, audio file name, video file name], and output_path is the output path of the synthesized MP4 file. Then you need to traverse all the subdirectories under the video album, and put the to-be-processed video into a proc_fileList list:

import  os import  json proc_fileLis=[] def get_proList(init_path):     folder = os.listdir(init_path)     for  subfolder in folder:         name_path = os.path.join(init_path, subfolder)         json_file = os.path.join(name_path,  "entry.json" )          if  os.path.exists(json_file):             file_info = [] #Used to encapsulate the processed file information, the format is: [file name, audio file name, video file name]             with open(json_file, 'r' ) as f: #Extract the  file name from the json file                 data = json.load(f)                 file_name = data[ "page_data" ][ "part" ]                 file_info. append (file_name)             a_filename = os.path.join(name_path,  "16/audio.m4s" )             v_filename = os.path.join(name_path,  "16/video.m4s" )             file_info. append (a_filename)             file_info. append (v_filename)             proc_fileList. the append (FILE_INFO) # processing file on the tape in proc_fileList duplicated code

The input parameter init_path of the function is the first-level directory to be processed, that is, the folder named with 8 numbers mentioned above.
Then write the main function:

import  sys if  __name__ ==  "__main__" :     init_path = sys.argv[ 1 ]     get_proList(init_path)     for  proc_file in proc_fileList:          print (proc_file)         set_audio(proc_file,init_path) Copy code

Finally save it as
When using, open the terminal and enter the following command to complete the video conversion:

python processing file paths duplicated code

About the Author:

Airwolf, non-IT industry code farmer, national embedded system designer. I started my programming career by self-study BASIC since the 6th grade of elementary school. I love programming and like to use practical and concise programs to solve problems in work and life./

Reward code


As a decentralized global technology community, the Python Chinese community has the vision to become a spiritual tribe of 200,000 Python Chinese developers around the world. It currently covers major mainstream media and collaboration platforms, and is closely related to Alibaba, Tencent, Baidu, Microsoft, Amazon, and open source. China, CSDN and other well-known companies in the industry have established extensive connections with technical communities, with tens of thousands of registered members from more than a dozen countries and regions, members from the Ministry of Industry and Information Technology, Tsinghua University, Peking University, Beijing University of Posts and Telecommunications, the People's Bank of China, and the Chinese Academy of Sciences Government agencies, scientific research institutions, financial institutions, and well-known companies at home and abroad represented by CICC, Huawei, BAT, Google, Microsoft, etc., are followed by nearly 200,000 developers on the entire platform.

Click to become a registered member of the community ** Like the article, click ** to watch