in the practical work of search engine, not only use the word step into meaningful words, but also the use of continuous cutting way to extract keywords, and fingerprint calculation. Continuous cutting way to extract keywords, and fingerprint calculation, even cutting is the words backwards way of cutting is a single word backwards way of segmentation, for example, "the love of Shanghai began to fight against the trade links are cut into" love the sea "" of "" start "" first hit "" hit buy "strike" sale "and" sale chain "and" selling links". Then extract some keywords fingerprint calculation from these words, whether in contrast duplicate content. This is just the basic search engine recognition algorithm of duplicated web pages, there are many other algorithms deal with duplicated web pages.
] Reading on the Internet is so advanced today, the same information will be released in multiple sites, the same will be most of the media news website reported, plus a small station and Shanghai Longfeng personnel diligently network acquisition, the network has caused a large number of repeated letter information. However, when a user searches for a keyword, the search engine will not want to show to the user’s search results are the same content. Grab these duplicate pages, to a certain extent on the search engine itself is a waste of resources, so the removal of duplicate content on the web has become a major issue faced by the search engine.
to work will generally be carried out before and after the word index (possibly in word segmentation, search engines will be in before) the page has been separated keywords, keywords extraction has representative, then calculate the keywords "fingerprint". Every page has such characteristic fingerprint, fingerprint fingerprint keywords keywords when the new web crawls and indexes the page are coincident, so the new web search engine may be considered a duplicate content and give up index.
is so popular on the network’s most false original tools, not to deceive the search engine, is the content of the ghost doesn’t make sense, so the theory of using common pseudo original tools can’t get search engine rankings included and normal. But because of the love of Shanghai is not to repeat the page all directly abandoned but not indexed, would be appropriate to relax the standard index weight according to repeat the whole web site, so that some cheaters to benefit "
note: to "Shanghai dragon deeplyanalyzing Rui riffraff" book knowledge, thank the author wrote so good knowledge to our Shanghai dragon.
in the general search engine architecture, pages Spider in general grabbing part in existence, "weight" steps in the search engine in the framework of the implementation of the earlier, more can save processing system resource usage. The search engine will generally have to repeat the page grab the classified processing, for example, to determine whether a site contains a large number of duplicate pages, or the site is completely collected other site content, to determine the future of the site to grab or direct shielding grab.