4.6.4. Install Hadoop Replication

Installation of the Hadoop replication consists of multiple stages:

  1. Configure the source and target hosts following the prerequisites outlined in Appendix B, Prerequisites then follow the appropriate steps for the required extractor topology outlined in Chapter 3, Deploying MySQL Extractors.

  2. Install the Applier replicator which will apply information to the target Hadoop environment.

  3. Once the installation of the Extractor and Applier components have been completed, materialization of tables and views can be performed.

4.6.4.1. Applier Replicator Service

The applier replicator service reads information from the THL of the source and applies this to a local instance of Hadoop.

Important

Installation must take place on a node within the Hadoop cluster. Writing to a remote HDFS filesystem is not currently supported.

  1. Before installing the applier, the following additions need adding to the extractor configuration. Apply the following parameters, update the extractor and then install the applier

    • For Staging Install:

      shell> cd tungsten-replicator-6.0.5-40
      shell> ./tools/tpm configure alpha \
        --enable-batch-service=true
      shell> ./tools/tpm update
    • For INI Installs: Add the following the /etc/tungsten/tungsten.ini

      
      [alpha]
      ...Existing Replicator Config...
      enable-batch-service=true
      
      
      shell> tpm update
  2. The applier can now be configured.

    Unpack the Tungsten Replicator distribution in staging directory:

    shell> tar zxf tungsten-replicator-6.0.5-40.tar.gz
  3. Change into the staging directory:

    shell> cd tungsten-replicator-6.0.5-40
  4. Configure the installation using tpm:

    Show Staging

    Show INI

    shell> ./tools/tpm configure defaults \
        --reset \
        --user=tungsten \
        --install-directory=/opt/continuent \
        --profile-script=~/.bash_profile \
        --skip-validation-check=HostsFileCheck \
        --skip-validation-check=InstallerMasterSlaveCheck \
        --skip-validation-check=DatasourceDBPort \
        --skip-validation-check=DirectDatasourceDBPort \
        --skip-validation-check=ReplicationServicePipelines
    
    shell> ./tools/tpm configure alpha \
        --master=host1 \
        --members=host2 \
        --property=replicator.datasource.global.csvType=hive \
        --property=replicator.stage.q-to-dbms.blockCommitInterval=1s \
        --property=replicator.stage.q-to-dbms.blockCommitRowCount=1000 \
        --replication-password=secret \
        --replication-user=tungsten \
        --batch-enabled=true \
        --batch-load-language=js  \
        --batch-load-template=hadoop \
        --datasource-type=file
    
    shell> vi /etc/tungsten/tungsten.ini
    [defaults]
    user=tungsten
    install-directory=/opt/continuent
    profile-script=~/.bash_profile
    skip-validation-check=HostsFileCheck
    skip-validation-check=InstallerMasterSlaveCheck
    skip-validation-check=DatasourceDBPort
    skip-validation-check=DirectDatasourceDBPort
    skip-validation-check=ReplicationServicePipelines
    
    [alpha]
    master=host1
    members=host2
    property=replicator.datasource.global.csvType=hive
    property=replicator.stage.q-to-dbms.blockCommitInterval=1s
    property=replicator.stage.q-to-dbms.blockCommitRowCount=1000
    replication-password=secret
    replication-user=tungsten
    batch-enabled=true
    batch-load-language=js 
    batch-load-template=hadoop
    datasource-type=file
    

    Configuration group defaults

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

    Configuration group alpha

    The description of each of the options is shown below; click the icon to hide this detail:

    Click the icon to show a detailed description of each argument.

  5. Once the prerequisites and configuring of the installation has been completed, the software can be installed:

    shell> ./tools/tpm install

If the installation process fails, check the output of the /tmp/tungsten-configure.log file for more information about the root cause.

Once the service has been installed it can be monitored using the trepctl command. See Section 4.6.4.4, “Management and Monitoring of Hadoop Deployments” for more information. If there are problems during installation, see Section 4.6.4.5, “Troubleshooting Hadoop Replication”.

4.6.4.3. Accessing Generated Tables in Hive

If not already completed, the schema generation process described in Section 4.6.2.2, “Schema Generation” should have been followed. This creates the necessary Hive schema and staging schema definitions.

Once the tables have been created through ddlscan you can query the stage tables:

hive> select * from stage_xxx_movies_large limit 10;
OK
I	10	1	57475	All in the Family	1971	Archie Feels Left Out (#4.17)
I	10	2	57476	All in the Family	1971	Archie Finds a Friend (#6.18)
I	10	3	57477	All in the Family	1971	Archie Gets the Business: Part 1 (#8.1)
I	10	4	57478	All in the Family	1971	Archie Gets the Business: Part 2 (#8.2)
I	10	5	57479	All in the Family	1971	Archie Gives Blood (#1.4)
I	10	6	57480	All in the Family	1971	Archie Goes Too Far (#3.17)
I	10	7	57481	All in the Family	1971	Archie in the Cellar (#4.10)
I	10	8	57482	All in the Family	1971	Archie in the Hospital (#3.15)
I	10	9	57483	All in the Family	1971	Archie in the Lock-Up (#2.3)
I	10	10	57484	All in the Family	1971	Archie Is Branded (#3.20)