How to judge the love Shanghai repeat

E, for each category "," additional signature calculation;

B, the web page on the block after block filtering, to obtain the contents of the web page text contains fast;

C, the extraction of one or more sentences from the text, and calculate the text sentence according to the signature of one or more sentences;

The basic architecture of to judge the content of the web site to repeat

in the era of highly developed science and technology, Shanghai love has become the main way for people to access information. But now the love Shanghai, everywhere is full of some duplicate content, the users’ access to cause great distress. Therefore, Shanghai needs to "love repeated judgment, to repeat", only choose some high quality of my industry, Co browsing. However, the existing technology is generally by comparing the two page content and to borrow, to confirm the similarity of two pages.

in the first step, filtering the digital information in the sentence; the copyright information and other "repeated judgment plays a decisive role of information. Subsequently, the sentence conversion, for example, are full / half width conversion or traditional / simplified conversion in a uniform format makes the converted sentences.

in this step, the filter >

The method of

A, the web page into blocks;

B, filtering and conversion of web page text after the

F, according to the signed judgment under every kind of "whether to repeat.

A, clause of "text;



A, access to multiple web pages;

D, according to the web page text sentence signature clustering on multiple pages;

by the way of "repeated judgment system and judgment methods by including the web page text sentence, multidimensional signature signature effectively and quickly determine whether a page is repeated.

web page

C, from the extraction of web page content block.

B, were extracted from the web page text pages;

; clause clause;

in this step, the semicolon, period, exclamation marks and symbols to end a sentence clause of the web page text. In addition, but also through the web page of text to visual information on the web page text clause.

C, from one or more of the longest sentence from the text filtering and converted in

can be calculated accurately, but the time complexity is too high, the calculation is time consuming. The signature of some important information in a page, and then compare the two page signatures, to calculate the similarity, this way is simple and efficient, faster computation speed, more suitable for the application of such vast amounts of information of the Shanghai love scene.

text extraction

