Python crawling message board messages (two): multi-threaded version + selenium simulation

Python crawling message board messages (two): multi-threaded version + selenium simulation

1. Project overview

This project is mainly to capture the specific content of all messages in the leader's message board liuyan.people.com.cn/home?p=0 , extract and save the message details, reply details and evaluation details, and use them for subsequent data Analysis and further processing can provide a basis for government decision-making and the implementation of e-government affairs. For specific project description and environment configuration, please refer to the first Python crawling message board message in this series (1): single-process version + selenium simulation . This article has made some improvements on the basis of the first article

  1. Multi-threading is used, and the number of threads running at the same time is set to 3, and the number of threads is moderate, so that while ensuring that multiple threads are performing crawling at the same time, it can also avoid excessive threads that affect memory, CPU and network bandwidth. This is the main improvement of the project, which greatly reduces the overall running time.
  2. Optimized the exception handling. Previously, the exception handling was placed in the function to get all the message links corresponding to a leader. When the load more button is not available and the timeout is exceeded, an exception will be thrown, so that if the exception occurs in other Part of it will be ignored when getting the details of the message. After the improvement, it will be put into the main function, and exception handling will be placed for each leader. This covers all the operations when the leader is crawled. As long as an error is reported in any link, it will be caught At the same time, it adds 5 levels of nested exception handling, and increases the tolerance for exceptions (in the event of a bad network environment and the page cannot be loaded, the memory consumption is too large and the card is stuck, the crawled party crawls and cannot be crawled When the situation is obtained, the officials can be re-crawled to ensure the integrity of the data and the robustness of the program).

2. Project implementation

Since there are three commonly used methods to achieve multithreading in the implementation process, there are also three different specific implementations. Here, the first one is selected for description:

1. Import the required libraries

import csv import os import random import re import time import threading import dateutil.parser as dparser from random import choice from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui Import WebDriverWait from selenium.webdriver.chrome.options Import Options copy the code

Mainly import the processing libraries that need to be used in the crawling process and the classes to be used in selenium.

2. Global variables and parameter configuration

# Time node start_date = dparser.parse( '2019-06-01' ) # Control the number of threads running at the same time to 3 sem = threading.Semaphore( 3 ) # Browser setting options chrome_options = Options() chrome_options.add_argument ( 'Blink-Settings to false = imagesEnabled =' ) copying the code

We assume that only the messages after 2019.6.1 are crawled, because the messages before this are automatically praised and have no reference value, so set the time node, and set the number of threads running at the same time in the global to 3, and prohibit the webpage from loading pictures, reducing The bandwidth requirements of the network and the increase of the loading rate.

3. Generate random time and user agent

def get_time (): '''Get random time''' return round (random.uniform( 3 , 6 ), 1 ) def get_user_agent (): '''Get random user agent''' user_agents = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" , "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser ; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)" , "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322 ; .NET CLR 2.0.50727)" , "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)" , "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)" , "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)" , "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0 .04506.30)" , "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)" , "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6" , "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8 .1.2pre) Gecko/20070215 K-Ninja/2.1.1" , "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0" , "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5" , "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11" , "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0. 1036.7 Safari/535.20" , "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER" , "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)" , "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER" , "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4 .0C; .NET4.0E)" , "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0 .30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)" , "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)" , "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)" , "Mozilla/4.0 (compatible ; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0;. NET4.0C; .NET4.0E)" , "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1" , "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko ) Version/5.0.2 Mobile/8C148 Safari/6533.18.5" , "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre" , "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0" , "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" , "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10" , "MQQBrowser/26 Mozilla/5.0 ( Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1" , "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1" , "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36" , "Mozilla/5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20" , "Mozilla/5.0 (Linux;u;Android 4.2.2;zh- cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" , "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" ] # Randomly generate an agent in the user_agent list as a simulated browser user_agent = choice(user_agents) return user_agent Copy code

Generate a random time and randomly simulate a browser for accessing web pages, reducing the possibility of being banned by the server as a crawler.

4. Get the leader's fid

def get_fid (): '''Get all leader IDs''' with open ( 'url_fid.txt' , 'r' ) as f: content = f.read() fids = content.split() return fidscopy code

Each leader has a fid for distinguishing. Here, the fid is manually obtained and saved in txt, and then read line by line when starting to crawl.

5. Get all the message links of leaders

def get_detail_urls ( position, list_url ): '''Get all the message links of each leader''' user_agent = get_user_agent() chrome_options.add_argument( 'user-agent=%s' % user_agent) drivertemp = webdriver.Chrome(options=chrome_options) drivertemp.maximize_window() drivertemp.get(list_url) time.sleep( 2 ) # Cycle loading page try : while WebDriverWait(drivertemp, 50 , 2 ).until(EC.element_to_be_clickable((By.ID, "show_more" ))): datestr = WebDriverWait(drivertemp, 10 ).until( lambda driver: driver.find_element_by_xpath( '//*[@id="list_content"]/li[position()=last()]/h3/span' )).text .strip() datestr = re.search( r'\d{4}-\d{2}-\d{2}' , datestr).group() date = dparser.parse(datestr, fuzzy = True ) print ( 'Crawling links-- ' , position, '--' , date) if date <start_date: break # simulate click to load drivertemp.find_element_by_xpath( '//* [@id="show_more"]' ).click() time.sleep(get_time()) detail_elements = drivertemp.find_elements_by_xpath( '//*[@id="list_content"]/li/h2/b/a' ) # Get all links for element in detail_elements: detail_url = element.get_attribute( 'href' ) yield detail_url drivertemp.quit() except TimeoutException: drivertemp.quit() get_detail_urls(position, list_url) Copy code

According to the fid provided in step 4, find the link of all the messages corresponding to a leader. Since the leader s message list is not displayed all at once, there is a Load More button below. Each time you need to click to load down, you need to simulate a click To operate, swipe down, and click again after it is fully loaded until the bottom is reached. It may slip to the bottom of the page and no longer display the button, or it may not be loaded due to anti-crawl or poor network. At this time, the positioning element will time out. Increase exception handling, recursive call. When the function returns a value, instead of returning a list at a time, a generator is generated through the yield keyword, and the URL is generated according to the progress of the program execution, which can reduce the pressure on the memory.

6. Get message details

def get_message_detail ( driver, detail_url, writer, position ): '''Get message details''' print ( 'Crawling messages--' , position, '--' , detail_url) driver.get(detail_url) # Judgment, if there is no comment, skip the try : satis_degree = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_element_by_class_name( "sec-score_firstspan" )).text.strip() except : return # Get the content of each part of the message message_date_temp = WebDriverWait(driver, 2.5 ).until ( lambda driver: driver.find_element_by_xpath( "/html/body/div[6]/h3/span" )).text message_date = re.search( r'\d{4}-\d{2}-\d{2}' , message_date_temp).group() message_datetime = dparser.parse(message_date, fuzzy= True ) if message_datetime <start_date: return message_title = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_element_by_class_name( "context-title-text" )).text.strip( ) label_elements = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_elements_by_class_name( "domainType" )) try : label1 = label_elements[ 0 ].text.strip() label2 = label_elements[ 1 ].text.strip() except : label1 = '' label2 = label_elements[ 0 ].text.strip() message_content = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_element_by_xpath( "/html/body/div[6]/p" )).text.strip() replier = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_element_by_xpath( "/html/body/div[8]/ul/li[1]/h3[1]/i" )).text.strip( ) reply_content = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_element_by_xpath( "/html/body/div[8]/ul/li[1]/p" )).text.strip() reply_date_temp = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_element_by_xpath( "/html/body/div[8]/ul/li[1]/h3[2]/em" )).text reply_date = re.search( r'\d{4}-\d{2}-\d{2}' , reply_date_temp).group() review_scores = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_elements_by_xpath( "/html/body/div[8]/ul/li[2]/h4[1]/span/span/span" )) resolve_degree = review_scores[ 0 ].text.strip()[:- 1 ] handle_atti = review_scores[ 1 ].text.strip()[:- 1 ] handle_speed = review_scores[ 2 ].text.strip()[:- 1 ] review_content = WebDriverWait(driver, 2.5 ).until( lambda driver: driver.find_element_by_xpath( "/html/body/div[8]/ul/li[2]/p" )).text.strip() = is_auto_review 'is' IF (( 'automatic default praise' in review_content) or ( 'default Evaluation' in review_content)) the else 'No' review_date_temp = WebDriverWait (Driver, 2.5 ) .until ( the lambda Driver: driver.find_element_by_xpath ( "/html/body/div[8]/ul/li[2]/h4[2]/em" )).text review_date = re.search( r'\d{4}-\d{2}-\d{2}' , review_date_temp).group() # Save to CSV file writer.writerow( [position, message_title, label1, label2, message_date, message_content, replier, reply_content, reply_date, satis_degree, resolve_degree, handle_atti, handle_speed, is_auto_review, review_content, review_date]) Copy code

We only need to have comments with comments, so we must filter out the comments without comments at the beginning. Then use xpath, class_name, etc. to locate the corresponding element to get the content of each part of the message. Each message saves a total of 14 attributes and saves them in csv.

7. Get and save all messages from the leader

user_agent = get_user_agent() chrome_options.add_argument( 'user-agent=%s' % user_agent) driver = webdriver.Chrome(options=chrome_options) list_url = "http://liuyan.people.com.cn/threads/list?fid={}#state=4" . format (fid) driver.get(list_url) try : position = WebDriverWait(driver, 10 ).until( lambda driver: driver.find_element_by_xpath( "/html/body/div[4]/i" )).text print (index, '-- crawling--' , position ) start_time = time.time() csv_name = position + '.csv' # If the file exists, delete and recreate if os.path.exists(csv_name): os.remove(csv_name) with open (csv_name, 'a+' , newline = '' , encoding = 'gb18030' ) as f: writer = csv.writer(f, dialect= "excel" ) writer.writerow( [ 'Job name' , 'Message title' , 'Message label 1' , 'Message label 2' , 'Message date' , 'Message content' , 'Responder' , 'Reply content' , 'Reply date' , 'Satisfaction Degree' , 'Solution Degree Points' , 'Handling Attitude Points' , 'Handling Speed Points' , 'Whether to Automatically Praise or Not' , 'Evaluation Content' , 'Evaluation Date' ]) for detail_url in get_detail_urls(position, list_url): get_message_detail(driver, detail_url, writer, position) time.sleep(get_time()) end_time = time.time() crawl_time = int (end_time-start_time) crawl_minute = crawl_time//60 crawl_second = crawl_time% 60 print (position, 'Crawling is over!!!' ) print ( 'The leader took: {}minutes{}seconds.' . format (crawl_minute, crawl_second)) driver.quit() time.sleep( 5 ) except : driver.quit() get_officer_messages(index, fid) Copy code

Obtain the position information of the leader and create a separate csv for the leader to save the extracted message information, add recursive calls for exception handling, call

get_message_detail()
The method obtains and saves the specific information of each message, and calculates the execution time of each leader.

8. Merge files

def merge_csv (): '''Merge all files''' file_list = os.listdir( '.' ) csv_list = [] for file in file_list: if file.endswith( '.csv' ): csv_list.append(file) # If the file exists, delete and recreate if os.path.exists( 'DATA.csv' ): os.remove( 'DATA.csv' ) with open ( 'DATA.csv' , 'a+' , newline = '' , encoding = 'gb18030' ) as f: writer = csv.writer(f, dialect= "excel" ) writer.writerow( [ 'Job name' , 'Message title' , 'Message label 1' , 'Message label 2' , 'Message date' , 'Message content' , 'Responder' , 'Reply content' , 'Reply date' , 'Satisfaction Degree' , 'Solution Degree Points' , 'Handling Attitude Points' , 'Handling Speed Points' , 'Whether it is Automatically Praised' , 'Evaluation Content' , 'Evaluation Date' ]) for csv_file in csv_list: with open (csv_file, 'r ' ,encoding = 'gb18030' ) as csv_f: reader = csv.reader(csv_f) line_count = 0 for line in reader: line_count += 1 if line_count != 1 : writer.writerow( (line[ 0 ], line[ 1 ], line[ 2 ], line[ 3 ], line[ 4 ], line[ 5 ], line[ 6 ], line[ 7 ], line[ 8 ], Line [ . 9 ], Line [ 10 ], Line [ . 11 ], Line [ 12 is ], Line [ 13 is ], Line [ 14 ], Line [ 15 ])) Copy the code

Merge the data of all the leaders that have been crawled.

9. Main function call

The realization of multithreading is mainly in this part, and there are 3 ways to realize it:

  • by
    threading.Semaphore()
    Specify the number of threads, and use the context processor when implementing the function as a thread parameter later
def main (): '''Main function''' fids = get_fid() print ( 'The crawler program starts to execute:' ) s_time = time.time() thread_list = [] # Add all threads to the thread list to control the number of threads executing at the same time for index, fid in enumerate (fids): t = threading.Thread(target=get_officer_messages, args=(index + 1 , fid)) thread_list.append([t, fid]) for thread, fid in thread_list: # 5 levels of nesting for exception handling try : thread.start() except : try : thread.start() except : try : thread.start() except : try : thread.start() except : try : thread.start() except : # If there is still an exception, join the failed list print ( 'The official failed to crawl and has been saved in the failed list for further crawling' ) if not os.path.exists( 'fid_not_success.txt' ): with open ( ' fid_not_success.txt' , 'a+' ) as f: f.write(fid) else : with open ( 'fid_not_success.txt' , 'a+' ) as f: f.write( '\n' + fid) continue for thread, fid in thread_list: thread.join() print ( 'The execution of the crawler program ends!!!' ) print ( 'Start to synthesize files:' ) merge_csv() print ( 'File synthesis is over!!!' ) e_time = time.time() c_time = int (e_time-s_time) c_minute = c_time//60 c_second = c_time% 60 print ( 'Total time spent by {} leaders: {} minutes{} seconds.' . format ( len (fids), c_minute, c_second)) if __name__ == ' __main__ ' : '''Execute main function''' main() Copy code
  • by
    concurrent.futures.ThreadPoolExecutor
    Specify the number of threads and call
    submit()
    Function to implement thread call execution
def main (): '''Main function''' fids = get_fid() print ( 'The crawler program starts to execute:' ) s_time = time.time() with ThreadPoolExecutor( 3 ) as executor: for index, fid in enumerate (fids): # 5 levels of nesting for exception handling try : executor.submit(get_officer_messages, index + 1 , fid) except : try : executor.submit(get_officer_messages, index + 1 , fid) except : try : executor.submit(get_officer_messages, index + 1 , fid) except : try : executor.submit(get_officer_messages, index + 1 , fid) except : try : executor.submit(get_officer_messages, index + 1 , fid) except : # If there are still exceptions, join the failed list print ( 'The official failed to crawl, has been saved in the failed list for further crawling' ) if not os.path.exists ( 'fid_not_success.txt' ): with open ( 'fid_not_success.txt' , 'a+' ) as f: f.write(fid) else : with open ( 'fid_not_success.txt' , 'a+' ) as f: f.write( '\n' + fid) continue print ( 'The execution of the crawler program ends!!!' ) print ( 'Start to synthesize files:' ) merge_csv() print ( 'File synthesis is over!!!' ) e_time = time.time() c_time = int (e_time-s_time) c_minute = c_time//60 c_second = c_time% 60 print ( 'Total time spent by {} leaders: {} minutes{} seconds.' . format ( len (fids), c_minute, c_second)) if __name__ == ' __main__ ' : '''Execute main function''' main() Copy code
  • by
    concurrent.futures.ThreadPoolExecutor
    Specify the number of threads and call
    map()
    Function implements the mapping of functions and multiple parameters to execute threads
def main (): '''Main function''' fids = get_fid() print ( 'The crawler program starts to execute:' ) s_time = time.time() with ThreadPoolExecutor( 3 ) as executor: executor. map (get_officer_messages, range ( 1 , len (fids) + 1 ), fids) print ( 'The execution of the crawler program is over!!!' ) print ( 'Start to synthesize files:' ) merge_csv() print ( 'File synthesis is over!!!' ) e_time = time.time() c_time = int (e_time-s_time) c_minute = c_time//60 c_second = c_time% 60 print ( 'Total time spent by {} leaders: {} minutes{} seconds.' . format ( len (fids), c_minute, c_second)) if __name__ == ' __main__ ' : '''Execute main function''' main() Copy code

The main function first obtains all messages from the leader through multi-threading, then merges all data files, completes the entire crawling process, and counts the running time of the entire program, which is convenient for analyzing the operating efficiency.

3. Results, analysis and explanation

1. Result description

The 3 complete codes and test execution results can be downloaded by clicking download.csdn.net/download/CU... Welcome to test and exchange and learn, please do not abuse them . The entire execution process has greatly shortened the time compared to single-threaded. I chose 10 leaders for testing. The number of their messages is different in order to discover the advantages of multi-threading. The running results in the cloud server are as follows. The running time is shortened to Less than 1 hour and a half, which is about one-third of the first single thread. Because there are 3 sub-threads executing at the same time, the running time is greatly reduced, and the efficiency is much higher than before. After adding multiple threads, the running time can be increased. The longer and the shorter complement each other, and multiple threads run at the same time, crawling multiple leaders at the same time, obviously greatly reducing the running time. Finally got the merged DATA.csv: You can further summarize the advantages of multithreading:

  • Easy to schedule
  • Improve concurrency:

Concurrency can be easily and effectively achieved through threads. A process can create multiple threads to execute different parts of the same program.

  • Less overhead:

Creating a thread is faster than creating a process and requires very little overhead.

  • Conducive to give full play to the functions of multi-processors:

By creating a multi-threaded process, each thread runs on a processor, so as to achieve the concurrency of the application, so that each processor can be fully run.

2. Improve analysis

(1) This version of the code has not yet achieved automatic crawling of all fids, and needs to be saved manually. One of the shortcomings is that it can be improved later. (2) The selenium simulation that is still used to crawl the message details page will reduce the efficiency of the request. You can consider using the requests library to request. (3) This version has weak anti-climbing measures, so there will be exceptions in many cases, such as the obtained page is abnormal and the corresponding element cannot be found, the request time is extended, etc., further anti-climbing measures can be added in later versions , To further increase the robustness of the code.

3. Legality statement

  • This project is for the purpose of learning and scientific research. All readers can refer to the execution ideas and program code, but it cannot be used for malicious and illegal purposes (malicious attacks on website servers, illegal profits, etc.). Please be responsible for any violations.
  • The data obtained in this project is used to improve the implementation of e-government after further analysis, and can play a certain reference role for the government's decision-making. It is not for maliciously grabbing data to grab the advantage of unfair competition, and it is not for the advantage of unfair competition. It is used for commercial purposes to obtain illegal benefits. The running code is only tested with a few fids, not crawling on a large scale. At the same time, the crawling rate is strictly controlled, and the server is not stressed, such as infringing on the party (ie, the captured network subject) ) Of interest, please contact to change or delete.
  • This project is the second in the message board crawling series, and will continue to be updated in the later period. Readers are welcome to communicate with each other for continuous improvement.