
UBXPython: The New Scripting Engine for UBX Big Data Analytics
UBXPython is a powerful scripting engine that enables Python users to leverage UBX for custom analytics and reporting without specialized domain knowledge. It integrates NumPy for enhanced data manipulation and offers support for multiple map-reduce steps. Explore how UBXPython simplifies data processing in the UBX system.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
UBXPython THE NEW SCRIPTING ENGINE FOR UBX
UBXPython Key facts UBXPython is the new scripting engine for the UBX big data and Analytics system It allows end users familiar with the Python programming language to be real power users of the UBX Users can now do custom prepayment analytics, reports, and slice & dice operations without needing specialized domain knowledge of the UBX system or having to learn proprietary programming interfaces. UBXPython scripts are pure python files Users can access the rich set of exposed UBX interface by importing the UBX module Users can use the entire set of python language features, flow control, built-in data types and interface with UBX module methods to build custom Analytics and slice & dice apps UBXPython also integrates NumPy natively UBX internal matrix data and slice & dice query output data can be represented as a NumPy array Thus allowing users to harness the power of NumPy s data manipulation and array math tools to further slice & dice UBX data In addition to ad-hoc analysis, pre-defined analysis and reporting can also be performed through UBXPython
High level Architecture The Power user writes the UBX Python script Power User UBX Python Script Sends it off to UBX via our web based interface The UBXPython engine on the Governor and each of the Processing Nodes executes this script in parallel Governor The UBXPython engine is C++ based and it runs an embedded Python interpreter UBX Python Engine The embedded Python interpreter executes the Python code as calls into/back out to/from UBX C++ code Processing Node UBX Python Engine Processing Node UBX Python Engine Processing Noce UBX Python Engine Processing Node .. UBX Python Engine The final results are combined by the UBXPython engine on the Governor and the results to sent back to the User UBX System
UBXPython New in version 3 Support for multiple map-reduce steps Our UBX system operates as a highly optimized map-reduce system with compute Nodes and aggregating reducers Typically we expose this to the script power user by way of parallelOn() and parallelOff commands All code within the parallelOn and parallelOff lines are executed on the compute Nodes Code outside of these are executed on the Aggregator node (SysGovernor) All variables from Nodes are aggregated in the SysGovernor as soon as a parallelOff is encountered. Typically we have one parallelOn and one parallelOff block in a script We now have support for multiple parallel blocks in the script Enables intermediate aggregation of variable before continuing processing on compute Nodes Support for explicit aggregation In addition to implicit aggregation of all variables in a parallelOff, we can have targeted, explicit aggregation now This is accomplished by the new SendAggregate(varName) command This command sends just the variable value over to the SysGovernor to be aggregated
UBXPython New in version 3 - cont. Support for customized aggregation of standard Python containers Python dictionary The user can define Python dict with key/values The values can have dissimilar types (int, string, lists) These are aggregated on a paralleOff by default Can also work with a SendAggregate command Python lists Same as above can have with dissimilar types Works by aggregating values as default for parallelOff Works similarly with SendAggregate command There is a new command that applied to lists: SendAppend(varName) the vaiues from each node are appended to the list and sent back to the Nodes to continue processing Python sets Support for standard Python sets Can have members of dissimilar types Support for paralleOff SendAggregate and SendAppend commands
Typical UBXPython Program structure # import the entire UBX module # and also import any other packages needed, e.g. NumPy from UBXMod import * import numpy as np # define a function def aFunc(): print "inside aFunc()" # body of function, e.g. Load a File aFileObj = UBX.Load( filename ) return aFileObject #start main body of script # include any python language elements, variable declaration, manipulations # including lists, dicts, etc. # Parallel section define and manipulate UB Objects ubPy.ParallelOn() # e.g. aUBFileObject = aFunc() #manipulate UB object, e.g. do queries, assign to variables # do numpy math with returned objects, etc ubPy.ParallelOff() # do regular post parallel aggregation statements to be executed at the Governor
A Simple Example Load and merge two files, Iterate thru each record on the merge and output/print conditionally # import the entire UBX module # and also import any other packages needed, e.g. NumPy from UBXMod import * import numpy as np # start main body ubPy.parallelOn() # Load two input data fileObj1 = ubPy.Load("fh_arm_io_pdb_201804") fileObj2 = ubPy.Load("fh_arm_gen_ati_ext1") # define the merge params inputs, fields to merge on, type of merge, etc inputList = [fileObj2, fileObj1] lookupList = [fileObj1, fileObj2] byFieldList = ['poolno', ['poolno', 'firstpi ]] dictMergeParams = {'inputs': inputList, 'byFields': byFieldList, 'mode': 'outer } # do the actual merge and create the Merge Obj theMergeObj = ubPy.Merge(dictMergeParams) # define an OutputObj typically a output file with the specified table name to be written from the MergeObj dictOutputParams = {"table": "fh_arm_io_pdb_ext5_test" oStatus = ubPy.Output (theMergeObj, dictOutputParams) # define some PrintObjs to print out some columns from the Merge Obj pass in list of cols thePrintObj1 = ubPy.TextArrayWrapper(theMergeObj, ['poolno', 'firstpi', 'aols_ati']) theprintObj2 = ubPy.TextArrayWrapper(theMergeObj, ['poolno', 'dt1_ati ]) # Iterate thru each record of MergeObj, and conditionally write to the OutputObj and Print above Objs while ubPy.NextRecord(theMergeObj): if theMergeObj is not None and theMergeObj.Rcd_N1_ > 0: ubPy.Put(theMergeObj) # this write the record to the Output file ubPy.Print(thePrintObj1) else: ubPy.Print(thePrintObj2) ubPy.ParallelOff() # do regular post parallel aggregation statements to be executed at the Governor
Another Simple Example Load a file, create a NumPy array out of certain chosen columns from all records in the file # import the entire UBX module # and also import any other packages needed, e.g. NumPy from UBXMod import * import numpy as np # this time log to a separate log file logfileName = open("UBXPythonScrpt.log", 'w ) logfileName.write("Hello, World from simple_example_2.py Python file!! \n") # start main body ubPy.parallelOn() # Load two input data fileObj1 = ubPy.Load("fh_arm_io_pdb_201804") # to create a NumPy array we need to create a NumpyWrapperObj from the specified input obj (in this the fileObj1) and specified columns (fields) listFlds = ['poolno', 'firstpi'] numpyWrapperObj = ubPy.NumpyArrayWrapper(fileObj1, listFlds) # create the Numpy Array from the above NumpyWrapperObj testNumpyArray = ubPy.ToNumpyArray(numpyWrapperObj). # testNumpyArray is now a 1 D NumPy array with the data-type of structured Dtype # which is NumPy speak for an aggregated data-type in our case this is the data-type for the # chosen fields poolno and firstpi i.e a string of size6 and a float value # see the following log lines logfileName.write("testNumpyArray dType=" + testNumpyArray.dtype + "\n") # would print testNumpyArray dType= [(poolno', 'S10'), (firstpi', d8)] logfileName.write("testNumpyArray shape=" + testNumpyArray.shape + "\n") # would print num records # Feel free to do any NumPy Array manipulation, slicing/dicing with the above numpy array ubPy.ParallelOff() # do regular post parallel aggregation statements to be executed at the Governor
UBXPython vs existing Python API support Very different, not really competing interfaces! UBXPython Existing Python API Is a proper Python scripting interface allows the user to write python code to manipulate, slice and dice and create custom reports and analysis Is really just a canned python file acting a delivery mechanism for the cohort string/file that user has to create The output of this is always the output of the specified cohort Enables creating custom applications. The output is completely custom limited just by the power user s requirement and creativity Is a quick and easy way to specify a cohort and deliver it to UBX without using the web interface Includes native NumPy support to harness its data analysis and array manipulation capability with UBX data Completely extensible user can import other packages (say pandas, tensorFlow) and create new apps from the numpy data generated from UBX
References UBXPython Arch and Programming Reference document