论文标题:基于向量空间模型的网页信息过滤方法研究 Web Page Information Filtering Method Research Based on Vector Space Model 论文作者 论文导师 宋明秋,论文学位 硕士,论文专业 电子商务与物流管理 论文单位 大连理工大学,点击次数 120,论文页数 67页File Size3405K 2007-11-13论文网 http://www.lw23.com/lunwen_4697772/ Content Security;; Information Filtering;; Web Page Content Extraction;; Vector Space Model 互联网的发展带动了整个社会的发展与变革,电子商务的兴起改变了人们的生活方式,给人们的生活带来了极大便利。但随着电子商务的快速发展,安全问题越来越突出。网络钓鱼等非法站点的存在及迷信、色情、暴力和反动信息的传播严重威胁电子商务环境的内容安全。因此,为了营造一个安全、健康及和谐的电子商务环境,对不良网络信息的过滤就显得非常重要。然而传统的基于关键字和URL的过滤技术,已不能有效地解决这些问题。 本文介绍了内容安全技术的现状,将基于内容分析的信息过滤方法用于互联网内容安全的保护;研究了信息过滤中的中文分词、文本表示和特征提取关键技术,在特征项权重计算方面,分析了HTML标记对权重计算的影响,在改进传统TFIDF方法的基础上,提出了一种基于HTML标记加权的权重计算方法。 为了提高网页信息过滤系统的准确性,本文还进行了网页正文内容抽取方法的研究,在分析了中文网页布局的特点和网页中中文标点符号的分布特征后,提出了一种新的网页正文抽取方法,该方法将中文标点符号数及非超链接文字数与超链接所含文字数的比值作为识别网页正文内容的重要特征。实验结果表明,该方法不仅通用性强,而且精确度高。 最后,本文提出了一种新的过滤方案并进行了设计和实现:采用二级过滤策略,将基于URL的过滤技术和内容过滤技术有机结合,仅对URL过滤后标记为可疑的用户请求进行内容过滤,并根据内容过滤的结果更新URL列表,从而实现了URL过滤的实时性和高效性及内容过滤的全面性。该网页信息过滤系统采用Winsock 2 SPI进行HTTP数据包的截获,采用本文新提出的网页正文抽取方法进行网页正文抽取,采用向量空间模型进行文本表示;实验结果表明,该系统具有良好的过滤准确度和性能。 The development of internet boosts the development and transformation of society. The prevalence of electronic commerce changes people"s life style. But with the rapid development of electronic commerce, more and more security problems occur. Illegal web sites such as phishing web sites and unhealthy information such as superstition, pornography, violence, and anti-government threaten the content security of electronic commerce environment. Therefore it is necessary to filter unhealthy network information so as to keep electronic commerce environment secure, healthy and harmonious. Currently traditional filtering technology based on keyword and URL can"t solve the problem effectively. This paper introduces the situation of content security technology, applies information filtering method based on content analysis to protect content security, and studies Chinese word segmentation, text presentation and feature extraction in information filtering. Considering HTML tags can affect weight computation, an improved weight computation method based on TFIDF and HTML tags" weight is put forward. To improve the accuracy of web pages information filtering system, this paper studies approach to content extraction from Web Page. After analyzing the characteristic of Chinese web pages layout and the layout feature of Chinese punctuation in web pages, a new content extraction method is proposed, which can recognize web page content according to the number of Chinese punctuations and the ratio of non-hyperlink character number to character number that hyperlinks contain. Experimental results show that this method is accurate and suitable for most web sites. Finally, this paper proposes a new information filtering scheme and implements it, which applies two-level filtering strategy and combines filtering technology based on URL and content filtering. It only executes content filtering when the requested URL isn"t in white URL lists and black URL lists, and updates URL lists according to content filtering step"s result. In this way, it has both real-time characteristic of URL filtering and comprehensive characteristic of content filtering. The web page information filtering system captures HTTP packets by using Winsock 2 SPI, extracts web page content by applying the new proposed method and represents text by vector space model. Experimental results show that the system has good filtering accuracy and performance.
|