Xun's Studyroom: May 2005

Friday, May 20, 2005

[t] Java Review

I checked out the book "Sams Teach Yourself Java2 in 21 Days" for a Java review. The progress is fine. Till today I have finished:

Day8: Putting Interactive Programs on the Web.
Day 10: Adding Images, Animation and Sound.
Day 15: Class Roles: Packages, Interfaces and Other Features
Day 20: Designing a User Interface with Swing.
Day 21: Handling User Events with Swing.
Day 19: Java Beans and Other Advanced Features.
Day 12: Arranging Components on a User Interface.
Day 16: Exceptional Circumstances: Error Handling and Security.
Day 17: Handling Data Through Java Streams.
Day 18: Communicating Across the Internet.
Day 9: Making Programs Look Good with Graphics, Fonts and Color.
[... following studies will be put into here]

Good feeling to master something:).

Sunday, May 15, 2005

I tried to use java for mote signal plotting and it was successful. The plotting package I used is called Graph Class Library. Setting up Jbuilder took me quite some time, till now there are still problems to import tinyos packages.
Inspite of the cumbusomeness Jbuilder has, it is definitely a friendly and powerful development environment. The autocompletion and error checking are gorgeous.
No matter what language you are using: VC, VB or Java (could be python:P), the backstage schemes are the same: 1) prepare a buffer to store the plotting data, 2) update the buffer tail to make reflexion of time, 3) replot.
All three steps have the opportunity of improvment, in case signal comes in a high frequency. for step 1, a ring buffer or dynamic array can be used to avoid moving data in step 2; for step 2, update can be done in batch mode, or on a fixed inteval basis; for step 3, replotting can be done on part of the whole window area, or other window clipping techniques can be used.

[edit on 05/16]
The moteiv people keep warning a clean cygwin installation is a must: that's of reason. I found out that the tinyos java package path problem is actually caused by the cygwin installation. After doing a complete reinstall, problems solved.

Monday, May 09, 2005

[t] Usage of MSChart Control

MSChart control is a very useful activeX control which can fullfix virtually all excel style plottings. Prepare a datagrid and assign it the the control's ChartData member, then call control's Update() function, that's it.

One thing to note is that the first row and first column of the data grid is reserved to keep the heading data. So useful grids are actually from grid(1,1).

Tuesday, May 03, 2005

[r] liu99, "Mining Association Rules with Multiple Minimum Supports"

@inproceedings{312274,
author = {Bing Liu and Wynne Hsu and Yiming Ma},
title = {Mining association rules with multiple minimum supports},
booktitle = {KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining},
year = {1999},
isbn = {1-58113-143-7},
pages = {337--341},
location = {San Diego, California, United States},
doi = {http://doi.acm.org/10.1145/312129.312274},
publisher = {ACM Press},
address = {New York, NY, USA},
}

Motivation:
To address the large set finding problem caused by treating all item as of similar frequency.
Contributions:
1) Systematically stated why applying a universal minimum support value is problematic.
2) Raised algorithm MSapriori to address this problem.
Methods:
Followed apriori, but for level 2 sets are generated differently, additionally, the pruning is more restricted. ( I think the key equation cross the whole paper is : if c(1) in s or c(2)=c(1)..... ).
Discussions:
MSapriori is a neat algorithm. Also, when apply this on table datasets, the form is a bit different.

[r] zhai05, "Web Data Extraction Based on Partial Tree Alignment"

@inproceedings{1060761,
author = {Yanhong Zhai and Bing Liu},
title = {Web data extraction based on partial tree alignment},
booktitle = {WWW '05: Proceedings of the 14th international conference on World Wide Web},
year = {2005},
isbn = {1-59593-046-9},
pages = {76--85},
location = {Chiba, Japan},
doi = {http://doi.acm.org/10.1145/1060745.1060761},
publisher = {ACM Press},
address = {New York, NY, USA},
}

Motivation:
To effectively: 1) find data records in a webpage 2) align the data fields accross multiple data records.
Contributions:
Bulit the corresponding system DEPTA (or MDR-2);
Use visual cues to improve accuracy of the found data regions;
Proposed an algorithm for data field alignment based on partial tree alignment.
Methods:
Used visual cues got from browser rendering, the advantage is can improve accuracy and robustness;
Partial tree alignment.
Discussion:
The can be regarded as an alignment paper. The idea of using a seed tree as matching baseline and grow it is simple yet quite neat. Also this is an active research that is going on.

Monday, May 02, 2005

[r] zhang96, "Birch: An Efficient Data Clustering Method for Very Large Databases"

@inproceedings{ zhang96birch,
  author = "Tian Zhang and Raghu Ramakrishnan and Miron Livny",
  title = "{BIRCH}: an efficient data clustering method for very large databases",
  pages = "103--114",
  year = "1996",
  url = "citeseer.ist.psu.edu/zhang96birch.html" }

Motivation:
Clustering in large databases is very expensive. The reason is due to excessive database scans. This paper addresses this problem.
Contributions:
Proposed the idea of "local clustering" and proved that introduction of "local clustering" didn't affect correctness. This idea is quite like B-Tree in database field, so every cluster caculation and update is done locally.
Method:
Proved that Cluster Feature (CF) had some interesting properties; Introduced the algorithms to deal with CFs.
Discussion:
The idea in this paper is influential. Recently, J. Han's group propsed moving objects clustering based on "microclusters". Their ideas are largely borrowed from Birch.

[r] fagin03, "Search the Workplace Web"

@misc{ fagin03searching,
 author = "R. Fagin and R. Kumar and K. McCurley and J. Novak and D. Sivakumar and
   J. Tomlin and D. Williamson",
 title = "Searching the workplace web",
 text = "Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar,
   John A. Tomlin, and David P. Williamson. Searching the workplace web. In
   Proceedings of the Twelfth International World Wide Web Conference, Budapest,
   2003.",
 year = "2003",
 url = "citeseer.ist.psu.edu/fagin03searching.html" }

Motivation:
To clearly characterize intranet searching problem and find proper approaches.
Contributions:
Authors stated a few key differences between intranet searching and internet searching; and proposed to use an aggregation of searching strategies to find the most accurate result.
Methods:
Searching strategy aggregation. Details in the paper with straight algorithms.
Discussion:
The most contribution this paper has is pointing out the difference between intranet and internet search.

[r] han00, "Mining Frequent Patterns Without Candidate Generation"

@inproceedings{ han00mining,
  author = "Jiawei Han and Jian Pei and Yiwen Yin",
  title = "Mining frequent patterns without candidate generation",
  booktitle = "2000 ACM SIGMOD Intl. Conference on Management of Data",
  month = "05",
  publisher = "ACM Press",
  editor = "Weidong Chen and Jeffrey Naughton and Philip A. Bernstein",
  isbn = "1-58113-218-2",
  pages = "1--12",
  year = "2000",
  url = "citeseer.ist.psu.edu/han99mining.html",
  url = "citeseer.nj.nec.com/han99mining.html" }

Motivation:
To address apriori-like algorithms' combinational limitation.
Contributions:
Proposed data structure FP-tree, which is a compact representation of the database, without loss of frequent pattern information; also proposed FP-Growth algorithm to mine frequent patterns on FP-tree.
Methods:
FP-tree make database compact by prefix sharing.
FP-grows use a divide-and-conquer technique to divide fp-mining at each stage k into 1-item mining and k-1 item mining.
Discussions:
Though experiment result in paper showed that FP-tree and FP-growth greatly improved storage and computation efficient. There was no in-depth complexity analysis in the paper.

[r] agrawal95, "Mining Sequential Patterns"

@inproceedings{ agrawal95mining,
   author = "Rakesh Agrawal and Ramakrishnan Srikant",
   title = "Mining sequential patterns",
   booktitle = "Eleventh International Conference on Data Engineering",
   publisher = "IEEE Computer Society Press",
   address = "Taipei, Taiwan",
   editor = "Philip S. Yu and Arbee S. P. Chen",
   pages = "3--14",
   year = "1995",
   url = "citeseer.ist.psu.edu/agrawal95mining.html" }

Motivation:
To mine patterns of sequences.
Contributions:
Clearly defined the sequence mining problem. Defined support in this problem. Gave the whole process for sequence mining and 3 algorithms for the sequencing phase.
Methods:
Sequence mining can be done in 5 phases: sort, large itemset, transformation, sequence and maximal. Details can be found in paper, pay attention to the large itemset finding part and don't need to spend much on the 3 sequencing algorithms.
Discussions:
This algorithm can find all maximal frequent sequences. However, the apriori nature makes its performance low. More recently, J.Han et al did research on sequence mining performance.

[r] liu05, "Semi-Supervised Learning"

"Semi-Supervised Learning", Professor Bing Liu's instruction slides, UIC CS583 course, Spring 2005.
Motivation:
To address the classification problem with only positive labeled and unlabeled items provided.
Contribution:
1). Argued that unlabeled data is useful in increase classifier accuracy (this is certified by Nigam, et al in 2000, Prof. Liu extended the idea); 2). Theoretically proved that EM is usable in text mining; with methods of Spy-EM, Naive Baysian and SVM. 3). Proposed a new metric to evaluate classifier performance under the situation that F-Score can not be estimated.
Methods:
1). Spy-EM find reliable negative items in the M-stage. 2). r-square/Pr(f(x)=1) is used, where r can be estimated from the positive cases in validation set and Pr(f(x)=1) can be estimated from the whole validation set.
Discussions:
The concept of EM is not hard. The key point is to find a way for maximization in the M-stage. The author's main contribution is to define maximization problem to be find reliable negative cases under the constraint of keping r shreshold, and proved that r-squre/Pr(f(x)) =1 is a proper metric for classifier performance.

[r] Bayardo98, "Efficient Mining Long Patterns from Databases"

@inproceedings{276313,
author = {Roberto J. Bayardo, Jr.},
title = {Efficiently mining long patterns from databases},
booktitle = {SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data},
year = {1998},
isbn = {0-89791-995-5},
pages = {85--93},
location = {Seattle, Washington, United States},
doi = {http://doi.acm.org/10.1145/276304.276313},
publisher = {ACM Press},
address = {New York, NY, USA},
}

Motivation:
To address aprior's limitations in mining long patterns.
Contribution:
Proposed an effective approach to find Maximal Frequent itemsets without loss of completeness. Performance is one to two orders of magnitude better than apiori-like algorithms.
Methods:
Use item iteration tree to present all itemsets.
Do superset and subset pruning.
Use lower-bound support to save database scanning by using subset support information.
Discussions:
This approach can find maximal frequent itemsets efficiently. However, all frequent itemsets are implicit (though completely contained) in the caculated results. For tasks like association rule mining, there will still be substantial computation.

Xun's Studyroom