Friday, December 09, 2005

[r] Freitas01, "A survey of evolutionary algorithms for data mining and knowledge discovery"


@misc{ freitas01survey,
author = "A. Freitas",
title = "A survey of evolutionary algorithms for data mining and knowledge discovery",
text = "Freitas, A.A. (2001). A survey of evolutionary algorithms for data mining
and knowledge discovery. To appear in: Ghosh, A.; Tsutsui, S. (Eds.) Advances
in evolutionary computation. Springer-Verlag.",
year = "2001",
url = "citeseer.ist.psu.edu/freitas01survey.html" }


This paper introduced in detail how the two variants of genetic computation: genetic algorithms(GA) and genetic programming(GP), could be used in the data mining and knowledge discovery process. It is my first time to read the clear statements that generalize knowledge discovery process. The paper illustrated GA and GP's usages in the three stages of the classification tasks in DM and KDD, namely, preprocessing, rule generation and postprocessing. As gene express programming (GEP) is the evolved offspring of GA and GP, and lots of other DM/KDD tasks could be mapped to classification tasks. It is enlightful to solve a certain KDD question, such as one in web mining, with GEP.

[m] Supercomputing 2005

I firstly noticed a blogspot edit box change: arbitrarily choice of publishing date is no longer allowed. Is this a measure to call for blog owner's timely inputs, or a feature change after some G engineers investigated MSN space? (The M guys are actually allowing flexible publish time setting at present). Never mind, I was lazy and should be solely responsible for the delay of posts, no matter they were caught or not :P.

Supercomputing 2005 was great, partly because of, but definitely much more than, the Bill Gates show. Some people are laughing at Bill, yet we have to think why supercomputing became such attractive as Mr. Gates feel obligated to do a on-site promotion of MS Windows 2003 cluster.

In my opionion, during the past decade (or 12 years, count from the birth of Linux in 1991), the two elements of computer science - computer and data - has undergone drastic changes. Free operating systems like Linux has became open to public and of comparable quality as commercial ones, thus greatly relieved the budget concern of building up computation power. The most recent news I read, is the Ultimate Linux Lunchbox. The accessibility of supercomputing to general public is not a dream but very true fact now.

The characteristics of data, in volume, dimension and interestness, are changing rapidly. The first annual KDD conference was in 1993, still mainly concerntrated in data manipulation in databases. Nowadays everybody has the access to the world's biggest database - the web - and new "database" companys as Google and Yahoo are taking the positions used to be occupied by Oracle and Sybase. Everyone became a data consumer, more or less, aware or unaware.

Then, as a young computer scientist, how should I adapt to the changes? That's really an interesting topic to be thought about thoroughly.

Thursday, October 27, 2005

[t] video see-through equipments

This week I mainly spent time on investigation of video see-through techniques, for their possible use of our AR project. Before doing this work, many people (such as the audience of my talk in Robotics Lab, even including myself!) were skeptical about how reliable a system using video see-through will be. Well, here are the findings:
  • University of North Carolina has been experimenting using video-AR since year 2000. Their prototype including modified versions of Proview HMD and Sony Glasstron, with holders made by their own.
    The choice of Proview HMD is mainly due to its wide field of view (Model SR80: 53° (V)x 63° (H), 80° diagonal) and high resolution (1024x768 full color at 60Hz). The disadvantage of this product is that it's a bit bulkly and heavy (1.7 lbs for the head mounted part). At the same time, the price of the HMD is around $28k.
    The Sony glasstron version is clearly a PLM700 just as we are using, without the eye sheds (it's very easy to take off).
    Although I did not find the detail description of the camera model they used, I am pretty sure it's one of the Panasonic stick cameras commonly used in surveillance systems. Those cameras are color model and support up to 800x600 resolution. The video format is analog so there must be framegrabber used at the computer side.
  • Another interesting product series I found out are the Trivisio HMDs. This is a Germany company, and sell ready made stereo and monocular view see through HMDs. The cameras could support resolution as high as 800x600, as well as the display. The field of view is 40°diagonal, 32° horizontal and 24° vert. Weight of the most heavy model is 230g. Slightly heavier than the Sony Glasstron we are using (220g). If take camera weight into consideration, it's actually lighter.
    The interfact between these HMDs and the computer is USB2, which could transfer video at 480Mbps rate. Far exceeds our requirements.


    I sent price quotes to them and is waiting for reply. Possible guess of the price will be between 15k and 30k.
  • If we are going to build our own, we could take advantage of customized cameras. A reknown vendor in this field is PointGrey research(www.ptgrey.com), who manufacture a whole line of micro cameras. I have got quotes from them. Single camera price is around $2-3k. Lens could be bought from various optical vendors, and the prices vary from under $1k to $4k.
  • I was able to find a tracking alternative for wide field tracking. It is called HiBall, basically is a golf-ball size six-camera optical tracking system, with environment mounted infrared beacons. It could support tracking area as big as 1600 sqft, while maintain a satisfying accurary and precision. Details could be found at http://www.3rdtech.com/HiBall.htm. I have sent price quote to them, but guess the price will be considerable.


[Edited on 11/15/05]:
On my visit to the HiTLab, I got to know another vendor who is providing non-see-through HMD: eMagin. Their product is called 3D-Visor. It's not a see-through product but is stereo, relatively low cost (around $800) , could be full usb powered and has built-in inertial sensor for orientation tracking. Thank Phillips Lamb for his valuable input.

Monday, August 08, 2005

[t] static functions in C

in C, the 'static' modifier, when used on functions, just means the function's scope is limited to the file itself. This concept is not storage-related. However, when 'static' is used on variables, it does be storage-related. In contrast to 'auto' and 'register'.

A good reference for scopes in C is at here.

Another important and easy to get confused pair of concepts is constant-volatile. Reference for constant and volatile qualifers is at here.

Friday, May 20, 2005

[t] Java Review

I checked out the book "Sams Teach Yourself Java2 in 21 Days" for a Java review. The progress is fine. Till today I have finished:
  1. Day8: Putting Interactive Programs on the Web.
  2. Day 10: Adding Images, Animation and Sound.
  3. Day 15: Class Roles: Packages, Interfaces and Other Features
  4. Day 20: Designing a User Interface with Swing.
  5. Day 21: Handling User Events with Swing.
  6. Day 19: Java Beans and Other Advanced Features.
  7. Day 12: Arranging Components on a User Interface.
  8. Day 16: Exceptional Circumstances: Error Handling and Security.
  9. Day 17: Handling Data Through Java Streams.
  10. Day 18: Communicating Across the Internet.
  11. Day 9: Making Programs Look Good with Graphics, Fonts and Color.
  12. [... following studies will be put into here]
Good feeling to master something:).

Sunday, May 15, 2005

[t] Java Plotting

I tried to use java for mote signal plotting and it was successful. The plotting package I used is called Graph Class Library. Setting up Jbuilder took me quite some time, till now there are still problems to import tinyos packages.
Inspite of the cumbusomeness Jbuilder has, it is definitely a friendly and powerful development environment. The autocompletion and error checking are gorgeous.
No matter what language you are using: VC, VB or Java (could be python:P), the backstage schemes are the same: 1) prepare a buffer to store the plotting data, 2) update the buffer tail to make reflexion of time, 3) replot.
All three steps have the opportunity of improvment, in case signal comes in a high frequency. for step 1, a ring buffer or dynamic array can be used to avoid moving data in step 2; for step 2, update can be done in batch mode, or on a fixed inteval basis; for step 3, replotting can be done on part of the whole window area, or other window clipping techniques can be used.

[edit on 05/16]
The moteiv people keep warning a clean cygwin installation is a must: that's of reason. I found out that the tinyos java package path problem is actually caused by the cygwin installation. After doing a complete reinstall, problems solved.

Monday, May 09, 2005

[t] Usage of MSChart Control

MSChart control is a very useful activeX control which can fullfix virtually all excel style plottings. Prepare a datagrid and assign it the the control's ChartData member, then call control's Update() function, that's it.

One thing to note is that the first row and first column of the data grid is reserved to keep the heading data. So useful grids are actually from grid(1,1).

Tuesday, May 03, 2005

[r] liu99, "Mining Association Rules with Multiple Minimum Supports"

@inproceedings{312274,
author = {Bing Liu and Wynne Hsu and Yiming Ma},
title = {Mining association rules with multiple minimum supports},
booktitle = {KDD '99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining},
year = {1999},
isbn = {1-58113-143-7},
pages = {337--341},
location = {San Diego, California, United States},
doi = {http://doi.acm.org/10.1145/312129.312274},
publisher = {ACM Press},
address = {New York, NY, USA},
}
Motivation:
To address the large set finding problem caused by treating all item as of similar frequency.
Contributions:
1) Systematically stated why applying a universal minimum support value is problematic.
2) Raised algorithm MSapriori to address this problem.
Methods:
Followed apriori, but for level 2 sets are generated differently, additionally, the pruning is more restricted. ( I think the key equation cross the whole paper is : if c(1) in s or c(2)=c(1)..... ).
Discussions:
MSapriori is a neat algorithm. Also, when apply this on table datasets, the form is a bit different.

[r] zhai05, "Web Data Extraction Based on Partial Tree Alignment"

@inproceedings{1060761,
author = {Yanhong Zhai and Bing Liu},
title = {Web data extraction based on partial tree alignment},
booktitle = {WWW '05: Proceedings of the 14th international conference on World Wide Web},
year = {2005},
isbn = {1-59593-046-9},
pages = {76--85},
location = {Chiba, Japan},
doi = {http://doi.acm.org/10.1145/1060745.1060761},
publisher = {ACM Press},
address = {New York, NY, USA},
}
Motivation:
To effectively: 1) find data records in a webpage 2) align the data fields accross multiple data records.
Contributions:
Bulit the corresponding system DEPTA (or MDR-2);
Use visual cues to improve accuracy of the found data regions;
Proposed an algorithm for data field alignment based on partial tree alignment.
Methods:
Used visual cues got from browser rendering, the advantage is can improve accuracy and robustness;
Partial tree alignment.
Discussion:
The can be regarded as an alignment paper. The idea of using a seed tree as matching baseline and grow it is simple yet quite neat. Also this is an active research that is going on.

Monday, May 02, 2005

[r] zhang96, "Birch: An Efficient Data Clustering Method for Very Large Databases"

@inproceedings{ zhang96birch,
author = "Tian Zhang and Raghu Ramakrishnan and Miron Livny",
title = "{BIRCH}: an efficient data clustering method for very large databases",
pages = "103--114",
year = "1996",
url = "citeseer.ist.psu.edu/zhang96birch.html" }

Motivation:
Clustering in large databases is very expensive. The reason is due to excessive database scans. This paper addresses this problem.
Contributions:
Proposed the idea of "local clustering" and proved that introduction of "local clustering" didn't affect correctness. This idea is quite like B-Tree in database field, so every cluster caculation and update is done locally.
Method:
Proved that Cluster Feature (CF) had some interesting properties; Introduced the algorithms to deal with CFs.
Discussion:
The idea in this paper is influential. Recently, J. Han's group propsed moving objects clustering based on "microclusters". Their ideas are largely borrowed from Birch.

[r] fagin03, "Search the Workplace Web"

@misc{ fagin03searching,
author = "R. Fagin and R. Kumar and K. McCurley and J. Novak and D. Sivakumar and
J. Tomlin and D. Williamson",
title = "Searching the workplace web",
text = "Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar,
John A. Tomlin, and David P. Williamson. Searching the workplace web. In
Proceedings of the Twelfth International World Wide Web Conference, Budapest,
2003.",
year = "2003",
url = "citeseer.ist.psu.edu/fagin03searching.html" }
Motivation:
To clearly characterize intranet searching problem and find proper approaches.
Contributions:
Authors stated a few key differences between intranet searching and internet searching; and proposed to use an aggregation of searching strategies to find the most accurate result.
Methods:
Searching strategy aggregation. Details in the paper with straight algorithms.
Discussion:
The most contribution this paper has is pointing out the difference between intranet and internet search.

[r] han00, "Mining Frequent Patterns Without Candidate Generation"

@inproceedings{ han00mining,
author = "Jiawei Han and Jian Pei and Yiwen Yin",
title = "Mining frequent patterns without candidate generation",
booktitle = "2000 ACM SIGMOD Intl. Conference on Management of Data",
month = "05",
publisher = "ACM Press",
editor = "Weidong Chen and Jeffrey Naughton and Philip A. Bernstein",
isbn = "1-58113-218-2",
pages = "1--12",
year = "2000",
url = "citeseer.ist.psu.edu/han99mining.html",
url = "citeseer.nj.nec.com/han99mining.html" }
Motivation:
To address apriori-like algorithms' combinational limitation.
Contributions:
Proposed data structure FP-tree, which is a compact representation of the database, without loss of frequent pattern information; also proposed FP-Growth algorithm to mine frequent patterns on FP-tree.
Methods:
FP-tree make database compact by prefix sharing.
FP-grows use a divide-and-conquer technique to divide fp-mining at each stage k into 1-item mining and k-1 item mining.
Discussions:
Though experiment result in paper showed that FP-tree and FP-growth greatly improved storage and computation efficient. There was no in-depth complexity analysis in the paper.

[r] agrawal95, "Mining Sequential Patterns"

@inproceedings{ agrawal95mining,
author = "Rakesh Agrawal and Ramakrishnan Srikant",
title = "Mining sequential patterns",
booktitle = "Eleventh International Conference on Data Engineering",
publisher = "IEEE Computer Society Press",
address = "Taipei, Taiwan",
editor = "Philip S. Yu and Arbee S. P. Chen",
pages = "3--14",
year = "1995",
url = "citeseer.ist.psu.edu/agrawal95mining.html" }

Motivation:
To mine patterns of sequences.
Contributions:
Clearly defined the sequence mining problem. Defined support in this problem. Gave the whole process for sequence mining and 3 algorithms for the sequencing phase.
Methods:
Sequence mining can be done in 5 phases: sort, large itemset, transformation, sequence and maximal. Details can be found in paper, pay attention to the large itemset finding part and don't need to spend much on the 3 sequencing algorithms.
Discussions:
This algorithm can find all maximal frequent sequences. However, the apriori nature makes its performance low. More recently, J.Han et al did research on sequence mining performance.

[r] liu05, "Semi-Supervised Learning"

"Semi-Supervised Learning", Professor Bing Liu's instruction slides, UIC CS583 course, Spring 2005.
Motivation:
To address the classification problem with only positive labeled and unlabeled items provided.
Contribution:
1). Argued that unlabeled data is useful in increase classifier accuracy (this is certified by Nigam, et al in 2000, Prof. Liu extended the idea); 2). Theoretically proved that EM is usable in text mining; with methods of Spy-EM, Naive Baysian and SVM. 3). Proposed a new metric to evaluate classifier performance under the situation that F-Score can not be estimated.
Methods:
1). Spy-EM find reliable negative items in the M-stage. 2). r-square/Pr(f(x)=1) is used, where r can be estimated from the positive cases in validation set and Pr(f(x)=1) can be estimated from the whole validation set.
Discussions:
The concept of EM is not hard. The key point is to find a way for maximization in the M-stage. The author's main contribution is to define maximization problem to be find reliable negative cases under the constraint of keping r shreshold, and proved that r-squre/Pr(f(x)) =1 is a proper metric for classifier performance.

[r] Bayardo98, "Efficient Mining Long Patterns from Databases"

@inproceedings{276313,
author = {Roberto J. Bayardo, Jr.},
title = {Efficiently mining long patterns from databases},
booktitle = {SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data},
year = {1998},
isbn = {0-89791-995-5},
pages = {85--93},
location = {Seattle, Washington, United States},
doi = {http://doi.acm.org/10.1145/276304.276313},
publisher = {ACM Press},
address = {New York, NY, USA},
}
Motivation:
To address aprior's limitations in mining long patterns.
Contribution:
Proposed an effective approach to find Maximal Frequent itemsets without loss of completeness. Performance is one to two orders of magnitude better than apiori-like algorithms.
Methods:
Use item iteration tree to present all itemsets.
Do superset and subset pruning.
Use lower-bound support to save database scanning by using subset support information.
Discussions:
This approach can find maximal frequent itemsets efficiently. However, all frequent itemsets are implicit (though completely contained) in the caculated results. For tasks like association rule mining, there will still be substantial computation.

Friday, April 29, 2005

[t] making movies in Matlab

In matlab, a series of plotting can be recorded as a movie. Use a frame array to store the frames, then play them back either in Matlab or save to a AVI file.

I made one for the waveforms. However, the AVI file is a bit oversized, due to the lackness of effienent compressor.

Sample code is listed as following:
clear; clf;

load max_flx_0;
extensor=max_flx_0(:,5);
flexor=max_flx_0(:,4);
torque=max_flx_0(:,3);

moviefile=avifile('coac2.avi','compression','None');
segmentsize=10;
for j=1:(2500/segmentsize);
hold on;
subplot(3,1,1); plot(extensor(1:j*segmentsize),'g');
axis([0 3000 -5 5]);
hold on;
subplot(3,1,2); plot(flexor(1:j*segmentsize),'g');
axis([0 3000 -5 5]);
hold on;
subplot(3,1,3); plot(torque(1:j*segmentsize),'g');
axis([0 3000 -2 2]);
moviefile=addframe(moviefile, getframe(gcf));
end

moviefile=close(moviefile);

[edited on 05/11/05]
I found a very good AVI compressor: it compressed my 300M avi file to 1.3M mpg, cool!

Sunday, April 24, 2005

[t] TMote: working with matlab

There are nice documentation for tiny-os. How to use Matlab to interact with TMote is mentioned in one of the sections. Briefly: 1) Several tmote measure signals and transfer them through radio link. 2) A base tmote connected to PC get the radio signal, unpack the data, and send it to UART(USB). 3) User can read raw data directly from UART(USB) , or, use SerialForwarder to forward the raw data to a local network port. 4). The java application: osilloscope can plot signals (it supports multi-sensor or multi-channel!). 5) Matlab can call java directly. With the help of JavaMatlabControl class, java can also call Matlab.

the second son

Well, another "hello world" for a newborn blog. This one is used as a working log and readings notebook.

I have long been thinking about recording daily work progress and paper reading notes, but never put into action -- I am just too lazy. A Ph.D. student need to be big in work and ideas, as well as writing. Furnishing my third paper, enjoying the sunshine of spring, it is high time to do something for long-lagged plans.

Here are the naming conventions for this blog:
[R]eading: notes of literature I read.
[T]echnical: tech skills, tips and tricks.
[S]ummary: conference summaries.
[M]iscillaneous: misc stuff.