Don't be afraid of the principle, architecture, implementation, practice, and interview of "search" (worthy of collection)! ! !

Don't be afraid of the principle, architecture, implementation, practice, and interview of "search" (worthy of collection)! ! !

Perhaps 99% of students do not do search engines, but 99% of students must have implemented the search function. Searching, searching, and what technical things are included in it, I hope this article can give you some enlightenment.

What is the structure and process of the entire network search engine?

The macro structure of the entire network search engine is shown in the figure above. The core subsystem is mainly divided into three parts (the pink part):

(1) Spider crawling system ;

(2) search&index build index and query index system , this system is mainly divided into two parts:

  • One part is used to generate index data build_index

  • One part is used to query index data search_index

(3) Rank scoring and sorting system ;

The core data is mainly divided into two parts (the purple part):

(1) Web page library ;

(2) Index index data ;

The business characteristics of the entire network search engine determine that this is a separate system of "writing" and "retrieving".

How is writing implemented?

System composition : It is completed by two systems: spider and search&index.

Input : Internet pages generated by webmasters.

Output : forward and backward index data.

Process : such as 1, 2, 3, 4 in the architecture diagram:

(1) The spider grabs the Internet webpage;

(2) The spider stores Internet webpages in the webpage library (this requires high storage, and it needs to store almost the entire "World Wide Web" mirror image);

(3) build_index reads data from the webpage library and completes word segmentation;

(4) build_index generates an inverted index;

How is the search carried out?

System composition : It is completed by two systems of search&index and rank.

Input : the user's search term.

Output : sorted first page of search results.

Process : such as a, b, c, d in the architecture diagram:

(A) search_index obtains the user's search terms and completes word segmentation;

(B) search_index queries the inverted index to obtain the "character matching" webpage, which is the result of the preliminary screening;

(C) Rank ranks the results of the preliminary screening;

(D) Rank returns the first page result after sorting;

What is the structure and process of the site**** search engine?

After all, there are only a few companies that do search on the whole network. What most companies want to achieve is actually an in-site search. Take the search of 10 billion posts in 58 city as an example, the overall structure is as follows:

The macro structure of the search engine on the site is shown in the figure above. Compared with the macro structure of the search engine of the whole network, the only difference is written:

(1) The whole network search requires spiders to passively grab data;

(2) Site search is the data generated by the internal system, for example, the "publishing system" will actively push the generated posts to the build_data system;

Voiceover: The seemingly "small" difference, but the difficulty in implementing the architecture is much worse. It is very difficult to find "full" webpages in "real-time" search on the entire network, but it is easy to obtain all the data in real-time by searching on the site.

For the three systems of spider, search&index, and rank:

(1) Spider and search&index are relatively engineering systems;

(2) Rank is a system closely related to business, strategy, and algorithm. The difference in search experience is mainly here, and the optimization of business and strategy takes time to accumulate. The enlightenment here is:

  • Google s experience is better than that of Baidu, because the former ranks well

  • It is very difficult for domestic Internet companies (such as 360) to build a search engine that can experience beyond Baidu in a short time. It really takes time to accumulate.

The previous content is too macro. In order to take care of most of the students who have never done a search engine, the data structure and algorithm part start from the front index and the inverted index a little bit.

What is a forward index?

In short, the process of querying entities by key uses a forward index.

For example, the user table:

t_user(uid, name, passwd, age, sex)

In the process of querying the entire row by the uid, the index query is aligned at the time.

Another example, web page library:

t_web_page(url, page_content)

The process of querying the entire webpage by url is also a forward index query.

After the word segmentation of the page content, page_content will correspond to a set list after the word segmentation.

Simply, the positive index can be understood as:

Map<url, list>

A data structure of the content can be quickly found by the web page url.

Voiceover: The time complexity can be considered O(1).

What is an inverted index?

Contrary to the forward index, the inverted index is used in the process of querying the key from the item.

For web search, the inverted index can be understood as:

Map<item, list>

The data structure of the web page containing the query can be quickly found by the query.

Voiceover: The time complexity is also O(1).

For example, suppose there are 3 web pages:

url1 -> "I love Beijing"

url2 -> "I love home"

url3 -> "Home is beautiful"

This is a positive index:

Map<url, page_content>.

After the participle:

url1 -> {I, love, Beijing}

url2 -> {I, love, home}

url3 -> {home, beautiful}

This is a forward index after word segmentation:

Map<url, list>.

Inverted index after word segmentation:

Me -> {url1, url2}

Love -> {url1, url2}

Beijing -> {url1}

Home -> {url2, url3}

Wonderful -> {url3}

Quickly find the web page Map<item, list> that contains the query term by the search term item, which is the inverted index .

Voiceover: Understood, the process from words to URLs is an inverted index.

The forward index and the inverted index are data structures established in advance by the spider and build_index systems. Why use these two data structures is because they can quickly realize the "user web page retrieval" requirement.

In the voice-over, business requirements determine the implementation of the architecture, and queries are quick.

What is the retrieval process?

Assuming the search term is "I love":

(1) Participle, "I love" will be participle { , }, time complexity is O(1);

(2) For each item after word segmentation, query the list of web pages containing this item from the inverted index, and the time complexity is also O(1):

Me -> {url1, url2}

Love -> {url1, url2}

(3) Seeking the intersection of the list is the result webpage that meets all the query terms. For this example, {url1, url2} is the final query result;

Voiceover: The retrieval process is also very simple: word segmentation, inverted index search, and result set intersection.

Is it over? In fact, the time complexity of word segmentation and inverted query are both O(1). The time complexity of the entire search depends on "seeking the intersection of lists", and the problem is transformed into finding the intersection of two sets.

Character URLs are not conducive to storage and calculation. Generally speaking, each URL will have a numeric url_id to identify it. In the following text, for the convenience of description, the list will be replaced by list<url_id>.

How to find the intersection of list1 and list2?

Solution 1: for * for, native method, time complexity O(n*n)

There are many web pages hit by each search term, and the complexity of O(n*n) is obviously unacceptable. The inverted index can be sorted at the beginning of its creation, and the problem is transformed into the intersection of two ordered lists, which is much more convenient.

Voiceover: a dumb way.

Scheme 2: Find the intersection of ordered lists, zipper method

Ordered set 1{1,3,5,7,8,9}

Ordered set 2{2,3,4,5,6,7}

Two pointers point to the first element and compare the size of the elements:

(1) If they are the same, put them into the result set and move a pointer at will;

(2) Otherwise, move the pointer with the smaller value until the end of the team;

The advantages of this method are:

(1) The elements in the set are compared at most once, and the time complexity is O(n);

(2) Multiple ordered sets can be performed at the same time, which is suitable for the intersection of multiple word segmentation items to find url_id;

This method is like the gears on both sides of a zipper. One-to-one comparison is like a zipper, so it is called the zipper method;

Voiceover: The inverted index is initialized in advance, and the "order" feature can be used.

Option 3: Parallel optimization of buckets

When the amount of data is large, url_id bucket horizontal segmentation + parallel operation is a common optimization method. If list1<url_id> and list2<url_id> can be divided into several bucket intervals, each interval uses multithreading to find the intersection in parallel. The union of the result sets of each thread, as the final result set, can greatly reduce the execution time.

For example:

Ordered set 1{1,3,5,7,8,9, 10,30,50,70,80,90}

Ordered set 2{2,3,4,5,6,7, 20,30,40,50,60,70}

Find the intersection, first split into buckets:

The range of bucket 1 is [1, 9]

The range of bucket 2 is [10, 100]

The range of bucket 3 is [101, max_int]

then:

Set 1 is split into

Set a{1,3,5,7,8,9}

Set b{10,30,50,70,80,90}

Set c{}

Set 2 is split into

Set d{2,3,4,5,6,7}

Set e{20,30,40,50,60,70}

Collection e{}

The amount of data in each bucket is greatly reduced, and there are no duplicate elements in each bucket. Multi-threaded parallel computing can be used:

The intersection of set a and set d in bucket 1 is x{3,5,7}

The intersection of set b and set e in bucket 2 is y{30, 50, 70}

The intersection of set c and set d in bucket 3 is z{}

Finally, the intersection of set 1 and set 2 is the union of x, y, and z, that is, the set {3,5,7,30,50,70}.

Voiceover: Multi-threading and horizontal segmentation are common optimization methods.

Option 4: bitmap optimized again

After the data is split into horizontal buckets, the data in each bucket must be within a range. If the collection meets this characteristic, you can use bitmap to represent the collection:

As shown in the figure above, assuming that all elements of set1{1,3,5,7,8,9} and set2{2,3,4,5,6,7} are within the range of the bucket value [1, 16], 16 bits can be used to describe these two sets, the element x in the original set, the x-th bit in this 16bitmap is 1, at this time two bitmaps are intersected, only the two bitmaps need to be "ANDed" , The 3, 5, and 7 bits of the bitmap of the result set are 1, indicating that the intersection of the original set is {3,5,7}.

Horizontal bucketing, after bitmap optimization, can greatly improve the efficiency of intersection, but the time complexity is still O(n). Bitmap requires a lot of contiguous space and takes up a lot of memory.

Voiceover: bitmap can represent collections, and it is very fast to find the intersection of collections.

Option five: skip list skiplist

The intersection of ordered linked lists is the most commonly used data structure, and it can reduce the complexity of the intersection of ordered sets from O(n) to close to O(log(n)).

Set 1{1,2,3,4,20,21,22,23,50,60,70}

Set 2 {50,70}

Intersection is required. If you use the zipper method, you will find that 1, 2, 3, 4, 20, 21, 22, and 23 have to be traversed invalidly once, and each element has to be compared, and the time complexity is O(n). Can you "skip some elements" every time?

The jump table appears:

When the set 1{1,2,3,4,20,21,22,23,50,60,70} creates a jump list, there are only three elements {1,20,50} at the first level, and the second level is the same as the ordinary linked list .

Set 2 {50,70} has only a first-level ordinary linked list due to its few elements.

In this way, in the process of implementing the "zipper" intersection, the pointer of set1 can jump from 1 to 20 and then to 50. Many elements can be skipped in the middle, and there is no need for one-to-one comparison. The time for jumping table to find intersection is complicated The degree is approximately O(log(n)), which is a common algorithm in search engines.

Brief summary:

(1) The entire network search engine system is composed of three subsystems: spider, search&index, and rank;

(2) The difference between the on-site search engine and the entire network search engine is that there is one less spider subsystem;

(3) The spider and search&index systems are two engineering systems , but the optimization of the rank system requires long-term tuning and accumulation;

(4) forward index (forward index) is url_id quickly find a page after page content list segmentation process;

(5) inverted index (inverted index) is a word quickly find the item on the page that contains the word list of <url_id> process;

(6) The process of user search is to segment words first, then find the list<url_id> corresponding to each item, and finally perform the process of set intersection;

(7) The methods for finding the intersection of ordered sets are:

  • Double for loop method, time complexity O(n*n)

  • Zipper method, time complexity O(n)

  • Horizontal bucketing, multi-threaded parallel

  • bitmap, greatly improving the parallelism of operations, time complexity O(n)

  • Jump table, time complexity is O(log(n))

Voiceover: The interview should be enough.

Most engineers may not have been exposed to the "search kernel", but the Internet business basically involves the "search" function. Or take the post business scenario of 58 same city as an example. The title of the post and the content of the post have strong user retrieval requirements. How should retrieval requirements be achieved in each stage of the gradual increase in business, traffic, and concurrency?

Original stage-LIKE

In the entrepreneurial stage, this method is often used to quickly achieve it.

The data may be stored in the database like this:

t_tiezi(tid, title, content)

To meet the retrieval requirements of the title and content can be achieved through LIKE:

select tid from t_tiezi where content like'% %'

This method can indeed quickly meet business needs, and the existing problems are also obvious:

(1) Low efficiency, full table scan is required each time, the amount of calculation is large, and the CPU is easy to be 100% when the concurrency is high;

(2) Word segmentation is not supported;

Primary stage-full-text index

How to quickly improve efficiency, support word segmentation, and have as little impact as possible on the original system architecture, the first thing that comes to mind is to establish a full-text index:

alter table t_tiezi add fulltext(title,content)

Use match and against to achieve the query requirements on the index field.

Full-text indexing can quickly realize the needs of word segmentation in the business, and quickly improve performance (after the word segmentation is inverted, at least not the full table scan), but there are also some problems:

(1) Only applicable to MyISAM;

(2) Since the full-text index uses the characteristics of the database, search requirements and ordinary CURD requirements are coupled in the database: when the search requirements are large concurrently, it may affect the CURD requests; when the CURD concurrency is large, the search will be very slow;

(3) When the data volume reaches one million level, the performance will still be significantly reduced, the query return time is very long, and the business is difficult to accept;

(4) It is difficult to expand horizontally;

Intermediate stage-open source external index

In order to solve the limitations of full-text search, when the amount of data increases to several millions or tens of millions, external indexes must be considered. The core idea of the external index is to separate the index data from the original data. The former meets the search requirements and the latter meets the CURD requirements. A certain mechanism (double writing, notification, and regular reconstruction) is used to ensure data consistency.

The original data can continue to use Mysql to store, how to implement the external index? Solr, Lucene, and ES are all common open source solutions. Among them, ES (ElasticSearch) is currently the most popular.

Although Lucene is good, its potential shortcomings are:

(1) Lucene is just a library, you need to do your own services, and implement complex features such as high availability/scalability/load balancing by yourself;

(2) Lucene only supports Java, if you want to support other languages, you must do your own services;

(3) Lucene is not friendly. This is very fatal and very complicated. Users often need to have a deep understanding of search knowledge to understand its working principle. In order to shield its complexity, they still have to do their own services;

In order to improve Lucene's various shortcomings, the solution is to "encapsulate an interface-friendly service and shield the underlying complexity", so ES:

(1) ES is a service that uses Lucene as the core to realize the search function and provides the REStful interface;

(2) ES can support a large amount of information storage, and support very concurrent search requests;

(3) ES supports clusters, shielding users from complex features such as high availability/scalability/load balancing;

Currently, Kuaigou Taxi uses ES as its core search service to fulfill various search requirements in business, including:

(1) The demand for "Interface Time-consuming Data Collection" with the largest amount of data, the amount of data is about 1 billion;

(2) The demand for "longitude and latitude, geographic location search" with the largest amount of concurrency, the average online concurrency is about 2000, and the concurrency of stress test data is about 8000;

Therefore, ES can fully meet the common search business needs of 1 billion data volume and 5k throughput.

Advanced stage-self-developed search engine

When the data volume further increases, reaching 1 billion, 10 billion data volume; the concurrent volume also further increases, reaching 100,000 throughput per second; when the business personality also gradually increases, self-developed search engine is required, and the search kernel is customized. Up.

When it comes to the stage of customized self-developed search engine, the design focus is on super large data volume and super high concurrency. In order to meet the demand of "unlimited capacity and unlimited concurrency", the architecture design needs to focus on "scalability" and strive to achieve: increase machines Can expand (data volume + concurrency).

The preliminary structure diagram of 58.com's self-developed search engine E-search is as follows:

(1) The upper proxy (pink) is to access the cluster , which is an external portal and accepts search requests. Its statelessness can ensure that the performance of the proxy cluster can be expanded by adding more machines;

(2) The middle-level merger (light blue) is a logical cluster , which is mainly used to implement search and merge, as well as scoring and sorting. The business-related rank is implemented in this layer, and its statelessness can also ensure that the merger cluster can be expanded by adding more machines. performance;

(3) The bottom searcher (dark red big box) is to retrieve the cluster. The service and index data are deployed on the same machine. When the service starts, the index data can be loaded into the memory, and the data is loaded from the memory when the access is requested. The access speed is very fast:

  • In order to meet the scalability of data capacity , the index data is horizontally segmented, and the number of segments can be increased, and the performance can be expanded infinitely. As shown in the figure above, the searcher is divided into 4 groups

  • In order to meet the performance scalability of a piece of data, the same piece of data is redundant. In theory, if the machine is added, the performance can be expanded unlimitedly. As shown in the figure above, each group of searchers has 2 redundant copies.

With this design, it is true that adding machines can carry more data and respond to higher concurrency.

Brief summary:

In order to meet the needs of the search business, as the amount of data and concurrency grows, the search architecture generally goes through several stages:

(1) Original stage-LIKE;

(2) Primary stage-full-text index;

(3) Intermediate stage-open source external index;

(4) Advanced stage-self-developed search engine;

The last advanced topic, about the real-time nature of search :

Why can Baidu retrieve the new news 15 minutes ago in real time? Why can 58.com can retrieve the posts published 1 second ago in real time?

What are the main points of the real-time search engine system architecture?

In order to ensure real-time performance of search engines with large amounts of data and high concurrency, there are two key points in the architecture design:

(1) Index classification;

(2) dump&merge;

First of all, in the case of a very large amount of data, in order to ensure the efficient retrieval efficiency of the inverted index, any data update will not modify the index in real time.

Voiceover: Because once fragments are generated, retrieval efficiency will be greatly reduced.

Since the index data cannot be modified in real time, how to ensure that the latest web pages can be indexed?

Index classification, divided into full volume library, daily incremental library, hourly incremental library.

As described in the figure above:

(1) 30 billion data are in the full index database;

(2) 10 million data modified within 1 day are in the sky library;

(3) The data modified within one hour of 500,000 are in the hour database;

When a modification request occurs , only the lowest-level index, such as the hour database, will be operated.

When a query request occurs , the indexes of each level will be queried at the same time, the results will be merged, and the latest data will be obtained:

(1) The full library is a tightly stored index, no fragmentation, and fast;

(2) The sky library is tightly stored and fast;

(3) The amount of data in the hourly database is small and the speed is fast;

The hierarchical index can ensure real-time performance. Then, a new question is coming, when will hourly database data be reflected in the sky database, and when will the data in the sky database be reflected in the full database?

dump&merge, index export and merge, are completed by these two asynchronous tools:

dumper : Export online data.

Merger : The offline data is incorporated into a higher level to index.

Hour library, once an hour, merge into the sky library;

Tianku, once a day, merged into the full database;

This ensures that the data volume of the hour database and the sky database will not be particularly large;

If the amount of data and the amount of concurrency are greater, the weekly and monthly libraries can be added for buffering.

Brief summary:

Super large amount of data, super high concurrency, two architectural points of real-time search engine:

(1) Index classification;

(2) dump&merge;

Regarding "search" and "retrieval", has GET found new skills?

The road of architects -sharing technical ideas

Recommended reading:

" I'm not smart enough, but I just don't agree "

"The connection pool is so simple "

" Inseparable microservice architecture, inseparable RPC details (worth collecting)! ! !

Have you been asked during the interview?

As used herein, the article synchronization assistant synchronization