With the continuous development of the Internet information,the quantity of it enlarges day by day.In the recent years,the data of the information is developing as a speed of explosion.According to a reporter of IDC,from now on to 2011,it is predicted that the quantity will increase at 57% annually,making it become 988EB(1EB=1billion GB)by 2010,which is the 6 times as it in 2006 and equals all the information quantity of books' 18 million times.
Facing such a huge Internet database,HOW TO SEARCH ALL THE RELATED INFORMATION OF A SPECIFIC THEME QUICKLY EFFECTIVELY AND ECORNOMICLY becoms a hot study question nowadays.As the information of Internet increases rapidly and the wedsites which the search engines cover enlarge continuously,people find even they turn to search engines for help,it is more and more difficult to find their own information sources effectively.
This essay takes the part of its content in Chinese websites by a method of websits statistics after studying the solution of existing serach engines.This method make the websites shown as the DOM TREE form which is based on XML firstly,then using the statistical information from tree node to filter out the noise data nodes,and finally select the content nodes.This method is easier and more practical comparing to the traditional way based on the wrapper.
The outcomes of the experiment show it has a accuration over 90% and has a good using value.
手动翻译,有不足敬请谅解……