Tuesday, May 03, 2005

[r] zhai05, "Web Data Extraction Based on Partial Tree Alignment"

@inproceedings{1060761,
author = {Yanhong Zhai and Bing Liu},
title = {Web data extraction based on partial tree alignment},
booktitle = {WWW '05: Proceedings of the 14th international conference on World Wide Web},
year = {2005},
isbn = {1-59593-046-9},
pages = {76--85},
location = {Chiba, Japan},
doi = {http://doi.acm.org/10.1145/1060745.1060761},
publisher = {ACM Press},
address = {New York, NY, USA},
}
Motivation:
To effectively: 1) find data records in a webpage 2) align the data fields accross multiple data records.
Contributions:
Bulit the corresponding system DEPTA (or MDR-2);
Use visual cues to improve accuracy of the found data regions;
Proposed an algorithm for data field alignment based on partial tree alignment.
Methods:
Used visual cues got from browser rendering, the advantage is can improve accuracy and robustness;
Partial tree alignment.
Discussion:
The can be regarded as an alignment paper. The idea of using a seed tree as matching baseline and grow it is simple yet quite neat. Also this is an active research that is going on.