Develop chrome plug-in to realize crawler

Develop chrome plug-in to realize crawler

Hello everyone! I am a fan of front end~Caiji H

demand

  • After the plug-in is installed, the page request of the entire browser is intercepted, and the interception interface can be specified through the plug-in configuration to display, including downloading and exporting the intercepted data content, which can cooperate with the back-end to do many things

problem

  • Intercept all requests to assemble request information and results
  • The plug-in and the page communicate with each other and do corresponding operations

Effect picture

First of all, we first recognize a file called manifest.json, which is a configuration file. The chrome plug-in first reads this configuration file to do initialization work, such as the icon configuration of some plug-ins, the page displayed by the plug-in click, the permission authorization required by the plug-in, etc. Wait....

The first plugin

  • Create the configuration file manifest.json
  • Create a showcase icon
  • Create a showcase page
//chromePlugin file structure -icon -logo.png //The picture size cannot be larger than 129...it seems to be... if it is not displayed, make it smaller -background -index.html -index.js -browser_action -index.html -index.js -index.css -manifest.json Copy code

manifest.json

{ //Your plug-in name "name" : "chrome" , //Description "description" : "chrome plug-in" , //Version "version" : "1.0" , //Required, and fill in 2 "manifest_version" : 2 , //You can understand that your plug-in injects a background server "background" in the browser : { "page" : "/background/index.html" }, //The page displayed after the plug-in is clicked "browser_action" : { "default_icon" : "/icon/logo.png" , "default_title" : "chrome plug-in" , "default_popup" : "/browser_action/index.html" }, //icons "icons" : { "16" : "/icon/logo.png" , "32" : "/icon/logo.png" , "48" : "/icon/logo.png" , "128" : "/icon/logo.png" } } Copy code

browser_action/index.html && index.css

//index.html <!DOCTYPE html > < html lang = "en" > < head > < meta charset = "UTF-8" > < meta http-equiv = "X-UA-Compatible" content = "IE=edge " > < meta name = "viewport" content = "width=device-width, initial-scale=1.0" > < title > Document </title > /*Pay attention to the path rules of css introduced here*/ < link rel = "stylesheet" type = "text/css" href = "/browser_action/index.css" > </head > < body > < div class = "app" > My first chrome plugin </div > </body > </html > //index.css .app{ width: 200px;height: 100px;background: yellow; } Copy code

Briefly describe the configuration file process and background. You can understand that your plug-in always resides in the code block of the browser, and it can put some data shared to the plug-in page, etc...Browser_action refers to the content of the page displayed after the plug-in is clicked , You can try to write some content that you want to show, then open the browser, more tools -> extensions, open the development mode in the upper right corner, and drag your project directly in, it will be automatically recognized, and it will not be out In case of accident, your plug-in is installed, then click on the plug-in, then congratulations, your first chrome plug-in has been completed!

Question 1: Intercept all requests to assemble request information and results

The idea is: rewrite XMLHttpRequest and fetch, and then inject the rewritten code into each page through the configuration file provided by chrome. To intercept the effect, first know a content_scripts configuration, it is a tell chrome plug-in that I need to To load a configuration of my js on the current web page, add the following code to manifest.json

//The rule configuration "content_scripts" injected into the page js : [ { //Define which pages need to be injected with content script "<all_urls>" All pages "matches" : [ "<all_urls>" ], //css file address "css" : [], //Injected js file address "js" : [ "/contentScript/install.js" ], //Control the timing of content script injection. It can be document_start, document_end or document_idle. The default document_idle. "run_at" : "document_start" } ], //Get the path of the resources in the package through chrome.extension.getURL. Need to set access permissions in the manifest.json file web_accessible_resources "web_accessible_resources" : [ "/contentScript/network.js" ] Copy code

ok So I added a script command to inject js, now we need to create a folder and file/contentScript/install.js in the corresponding path, and then create a network.js and install.js in the contentScript folder

install.js

setTimeout ( () => { const script = document .createElement( 'script' ); script.setAttribute( 'type' , 'text/javascript' ); //Get the path of the resources in the package through chrome.extension.getURL. Need to set access permissions in the manifest.json file web_accessible_resources script.setAttribute( 'src' , chrome.extension.getURL( '/contentScript/network.js' )); document .head.appendChild(script); }); Copy code

Rewrite request interception method network.js

const tool = { isString ( value ) { return Object .prototype.toString.call(value) == '[object String]' ; }, isPlainObject ( obj ) { let hasOwn = Object .prototype.hasOwnProperty; //Must be an Object. if (!obj || typeof obj !== 'object' || obj.nodeType || isWindow(obj)) { return false ; } try { if (obj.constructor && !hasOwn.call(obj, 'constructor' ) && !hasOwn.call(obj.constructor.prototype, 'isPrototypeOf' )) { return false ; } } catch (e) { return false ; } let key; for (key in obj) {} return key === undefined || hasOwn.call(obj, key); } } //This class is based on Tencent open source vconsole (https://github.com/Tencent/vConsole), a class written for this plug-in class RewriteNetwork { constructor () { this .reqList = {}; //URL as key, request item as value this ._open = undefined ; //the origin function this ._send = undefined ; this ._setRequestHeader = undefined ; this .status = false ; this . mockAjax(); this .mockFetch(); } onRemove () { if ( window .XMLHttpRequest) { window .XMLHttpRequest.prototype.open = this ._open; window .XMLHttpRequest.prototype.send = this ._send; window .XMLHttpRequest.prototype.setRequestHeader = this ._setRequestHeader; this ._open = undefined ; this ._send = undefined ; this ._setRequestHeader = undefined } } /** * mock ajax request * @private */ mockAjax () { let _XMLHttpRequest = window .XMLHttpRequest; if (!_XMLHttpRequest) { return ;} const that = this ; //Save the native _XMLHttpRequest method for rewriting below const _open = window .XMLHttpRequest.prototype.open, _send = window .XMLHttpRequest.prototype.send, _setRequestHeader = window .XMLHttpRequest.prototype.setRequestHeader; that._open = _open; that._send = _send; that._setRequestHeader = _setRequestHeader; //Rewrite setting request header open window .XMLHttpRequest.prototype.open = function () { let XMLReq = this ; let args = [].slice.call( arguments ), method = args[ 0 ], url = args[ 1 ], id = that.getUniqueID(); let timer = null ; //may be used by other functions XMLReq._requestID = id; XMLReq._method = method; XMLReq._url = url; //mock onreadystatechange let _onreadystatechange = XMLReq.onreadystatechange || function () {}; //Poll regularly to check the event handler function that is called every time the readyState property changes. When readyState is 3, it may also be called multiple times. let onreadystatechange = function () { let item = that.reqList[id] || {}; //Restore initialization item.readyState = XMLReq.readyState; item.status = 0 ; //Sync XMLReq status if (XMLReq.readyState> 1 ) { item.status = XMLReq.status; } item.responseType = XMLReq.responseType; //Initialization state. The XMLHttpRequest object has been created or has been reset by the abort() method. if (XMLReq.readyState == 0 ) { if (!item.startTime) { item.startTime = (+ new Date ()); } //open() method has been called, but send() method has not been called. The request has not been sent } else if (XMLReq.readyState == 1 ) { if (!item.startTime) { item.startTime = (+ new Date ()); } //Send() method has been called and the HTTP request has been sent to the web server. No response was received. } else if (XMLReq.readyState == 2 ) { //HEADERS_RECEIVED item.header = {}; let header = XMLReq.getAllResponseHeaders() || '' , headerArr = header.split( "\n" ); //extract plain text to key-value format for ( let i= 0 ; i<headerArr.length; i++) { let line = headerArr[i]; if (!line ) { continue ;} let arr = line.split( ': ' ); let key = arr[ 0 ], value = arr.slice( 1 ).join( ': ' ); item.header[key] = value; } //All response headers have been received. The response body is received but not completed } else if (XMLReq.readyState == 3 ) { //HTTP response has been completely received. } else if (XMLReq.readyState == 4 ) { clearInterval (timer); item.endTime = + new Date (), item.costTime = item.endTime-(item.startTime || item.endTime); item.response = XMLReq.response; item.method = XMLReq._method; item.url = XMLReq._url; item.req_type = 'xml' ; item.getData = XMLReq.getData; item.postData = XMLReq.postData; that.filterData(item) } else { clearInterval (timer); } return _onreadystatechange.apply(XMLReq, arguments ); }; XMLReq.onreadystatechange = onreadystatechange; //Polling query status let preState = -1 ; timer = setInterval ( function () { if (preState != XMLReq.readyState) { preState = XMLReq.readyState; onreadystatechange.call(XMLReq); } }, 10 ); return _open.apply(XMLReq, args); }; //Rewrite the set request header setRequestHeader window .XMLHttpRequest.prototype.setRequestHeader = function () { const XMLReq = this ; const args = [].slice.call( arguments ); const item = that.reqList[XMLReq._requestID]; if (item) { if (!item.requestHeader) {item.requestHeader = {};} item.requestHeader[args[ 0 ]] = args[ 1 ]; } return _setRequestHeader.apply(XMLReq, args); }; //Rewrite send window .XMLHttpRequest.prototype.send = function () { let XMLReq = this ; let args = [].slice.call( arguments ), data = args[ 0 ]; let item = that.reqList[XMLReq._requestID] || {}; item.method = XMLReq._method? XMLReq._method.toUpperCase(): 'GET' ; let query = XMLReq._url? XMLReq._url.split( '?' ): []; //a.php?b=c&d=?e => ['a.php','b=c&d=', ' e'] item.url = XMLReq._url || '' ; item.name = query.shift() || '' ; //=> ['b=c&d=','e'] item.name = item.name.replace( new RegExp ( '[/]*$' ), '' ).split( '/' ).pop() || '' ; if (query.length> 0 ) { item.name += '?' + query; item.getData = {}; query = query.join( '?' ); //=>'b=c&d=?e' query = query.split( '&' ); //=> ['b=c','d=?e '] for ( let q of query) { q = q.split( '=' ); item.getData[ q[ 0 ]] = decodeURIComponent (q[ 1 ]); } } if (item.method == 'POST' ) { //save POST data if (tool.isString(data)) { let arr = data.split( '&' ); item.postData = {}; for ( let q of arr) { q = q.split( '=' ); item.postData[ q[ 0 ]] = q[ 1 ]; } } else if (tool.isPlainObject(data)) { item.postData = data; } else { item.postData = '[object Object]' ; } } XMLReq.getData = item.getData || "" ; XMLReq.postData = item.postData || "" ; return _send.apply(XMLReq, args); }; }; /** * mock fetch request * @private */ mockFetch () { const _fetch = window .fetch; if (!_fetch) { return "" ;} const that = this ; const prevFetch = function ( input, init ) { let id = that.getUniqueID(); that.reqList[id] = {}; let item = that.reqList[id] || {}; let query = [], url = '' , method = 'GET' , requestHeader = null ; //handle `input` content if (tool.isString(input)) { //when `input` is a string method = init.method? init.method: 'GET' ; url = input; requestHeader = init.headers? init.headers: null } else { //when `input` is a `Request` object method = input.method || 'GET' ; url = input.url; requestHeader = input.headers; } query = url.split( '?' ); item.id = id; item.method = method; item.requestHeader = requestHeader; item.url = url; item.name = query.shift() || '' ; item.name = item.name.replace( new RegExp ( '[/]*$' ), '' ).split( '/' ).pop() || '' ; if (query.length> 0 ) { item.name += '?' + query; item.getData = {}; query = query.join( '?' ); //=>'b=c&d=?e' query = query.split( '&' ); //=> ['b=c','d=?e '] for ( let q of query) { q = q.split( '=' ); item.getData[ q[ 0 ]] = q[ 1 ]; } } if (item.method === "post" ) { if (tool.isString(input)) { if (tool.isString(init.body && init.body)) { let arr = init.body.split( '& ' ); item.postData = {}; for ( let q of arr) { q = q.split( '=' ); item.postData[ q[ 0 ]] = q[ 1 ]; } } else if (tool.isPlainObject(init.body && init.body)) { item.postData = init.body && init.body; } else { item.postData = '[object Object]' ; } } else { item.postData = '[object Object]' ; } } //UNSENT if (!item.startTime) {item.startTime = (+ new Date ());} return _fetch(url, init).then( ( response ) => { response.clone().json().then( ( json ) => { item.endTime = + new Date (), item.costTime = item.endTime-(item.startTime || item.endTime); item.status = response.status; item.header = {}; for ( let pair of response.headers.entries()) { item.header[pair[ 0 ]] = pair[ 1 ]; } item.response = json; item.readyState = 4 ; const contentType = response.headers.get( 'content-type' ); item.responseType = contentType.includes( 'application/json' )? 'json' : contentType.includes( 'text/html' )? 'text' : '' ; item.req_type = 'fetch' ; that.filterData(item) return json; }) return response; }) } window .fetch = prevFetch; } filterData ( {url,method,req_type,response,getData,postData} ) { if (!url) return ; const req_data = { url, method, req_type, response, getData, //query parameter postData } console .log( 'result of interception' , req_data) } /** * generate an unique id string (32) * @private * @return string */ getUniqueID () { let id = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx' .replace( /[xy]/g , function ( c ) { let r = Math .random()* 16 | 0 , v = c == ' x' ? r: (r& 0x3 | 0x8 ); return v.toString( 16 ); }); return id; } } const network = new RewriteNetwork(); Copy code

Just open a test website, open the console on f12, and then there will be the intercepted result output, so that we have completed the page interface interception, and then we need to complete the communication between the plug-in and the page, and do the corresponding operation

Question 2: The communication between the plug-in and the page

  • inject_js (js actually inserted into the page) communicates with content_script
//inject_js uses the postMessage method to communicate with content_script, and sends data to content_script in the method of intercepting the request //network.js const senMes = ( data ) => { window .postMessage(data, '*' ); } .... console .log( 'result of interception' , req_data) senMes(req_data) //install.js //Receive the inject page message .... window .addEventListener( "message" , function ( e ) { const {data} = e; console .log( 'Receive networkJS data' ,data) }, false ); Copy code
  • content_script communicates with background (background permanent injection service)
//content_script/install.js const sendBgMessage = ( Data ) => { chrome.runtime.sendMessage({ type : 'page_request' ,data}, function ( response ) { console .log( 'Background reply:' + response); }); } //background (background permanent injection service) receiving chrome.runtime.onMessage.addListener( function ( request, sender, sendResponse ) { console .log( 'background receiving data' ,request) //reply sendResponse( 'bg background received message ' ) }); Copy code
  • The browser_action page communicates with background.js, basically the same
//browser_action page js const sendMes = ( data ) => { return new Promise ( resolve => { chrome.runtime.sendMessage( data, ( res )=> {resolve(res) }); }) } //background (background permanent injection service) receiving chrome.runtime.onMessage.addListener( function ( request, sender, sendResponse ) { console .log( 'background receiving data' ,request) //reply sendResponse( 'bg background received message ' ) }); Copy code

end

Basically some of the core content is here, the next step is to configure it according to your actual business scenario to complete it, and I will post the development documents below.

chrome Chinese document

chrome English document