Finding Split Points

Next: Performing Splits Up: Parallelizing Classification Previous: Data Placement and Workload

Continuous attribute
- In the parallel environment, each processor has a separate contiguous section of a "global" attribute list. Thus, the processor's C_below and C_above histograms are initialized to reflect that
- C_below is initialized to reflect the class distribution of all sections of the attribute-list assigned to processors of lower rank.
- C_above is initialized to reflect the class distribution of the local section as well as all sections assigned to processors of higher rank.
- The statistics for initializing C_above & C_below are gathered when attribute lists for new leaves are created. After collecting statistics, the information is exchanged between all the processors and stored with each leaf, where it is later used to initialize that leaf's c_above and C_below histograms.
- After processing the attribute-list section, the processors communicate to determine which of the N split points has the lowest cost.
Categorical attributes
- The count matrix built by each processor is based on "local" information only. Hence, the count matrix are exchanged to get the "global" counts.
- The global matrix is calculated by a coordinator.
- The global matrix is used to calculate the best split for each categorical attribute.

Next: Performing Splits Up: Parallelizing Classification Previous: Data Placement and Workload

DBMS
1999-03-11