I work in Siri Data Collections, so I had to help collect data for Siri Machine Learning Engineers.
To do this I took thousands of videos and pictures of app clip codes (similar to QR codes) at different angles, lighting, distance, etc. This generated a lot of unorganized data
I built a data pipeline in Python to generate the correct permutation of specifications set by our Machine Learning engineers.
The Python Data Pipeline made sure all specifications were met, generated a report, and merge sorted all the files in the correct order. It then uploaded the data to iCloud.
After taking all the pics & vids it took me hours to organize everything and get it all uploaded.
The pipeline took seconds to do the same thing that took me hours to do manually.
When I initially took over this Data Processing System the code would scan data from 1 node(i.e. a remote machine) at a time.
Each node produced thousands of video’s so parsing all videos per node took many hours.
I re-wrote it to become a multiprocessor script so that each scan could run in parallel.
Rather than each nodes' data being scanned one at a time, all the nodes' data was scanned in parallel.
I also implemented a logging system so we could monitor the progress.
The processing time went from about 8 hours to about 2 hours
Our Cluster in China was supposed to segment generated videos according to a particular specification.
Our new hires over there didn’t do this since they were still learning the system.
This created duplicate id’s that caused incorrect statistics.
The built in “on delete cascade” and “on update cascade” SQL commands transform all child tables’ records that reference the updated or deleted record to be cascaded. This wasn’t the functionality needed.
I implemented a custom Cascade algorithm that maintained a pointer to the parent tables’ record to be deleted and updated all child records to an ID that the deleted record was going to be replaced with.
It took me over 100 hours to figure out and comprehend.
This eliminated duplicates in our relational data model which joined 7 hierarchical tables.
The eliminated dupes gave us correct statistics.
Reference: foreign-key-referential-actions
When ever we get new data for users we have to ETL(extract, transform, and load) hundreds of raw data files.
These ETL ops produce thousands upon thousands of rows of data that have to get inserted into multiple tables.
These ETL ops require heavy-duty data cleaning & wrangling.
Each insert MUST, MUST, MUST maintain referential integrity according to our relational data model. This is the ABSOLUTE most complex part.
I wrote hundreds of lines of SQL code to adhere to & accomplish the problem specs.
The SQL code was assisted by pandas, Excel, and Python
New user data was inserted into our Relational Database Management System successfully.
We needed to forecast the data we would have in the future at the current rate we produce the data according to a particular set of specifications.
The hard part was making this forecast across 12 computers spread out across the country in a synchronized way.
I implemented a distributed system in Python using a driver node in our main Cupertino office and worker nodes outside of our main Cupertino office.
The driver & worker nodes stayed in Sync using Paramiko(a Secure Socket Shell Python library).
Each worker node sent its’ data to the driver node. The driver node did the main statistical computations then sent the stats to our Database Server.
Tableau then queried our db and rendered a data visualization showing the results.
A successful distributed system forecast. Leadership could then know what the future data would probably be.
We have distributed machine learning system on each of the 12 remote nodes streaming data to a centralized location. I wrote scripts to extract, transform, and load this data to a database. A Tableau Dashboard then picks up this newly inserted data & creates a data visualization from it.
My former manager, Dr. Gautam Krishna, wrote machine learning algorithms that watch video's to see if they're Invalid. I'm currently operating it and trying to learn how it works to hopefully improve it someday.
I wrote a script to scan local files on disk to make sure they match what was uploaded to the database
A team of 4 people and myself built a system to track which metrics we're behind in. I built a data visualization in Tableau for it.
I've built about 40 or 50 data viz's in Tableau to render various stats.
A prior team member built a system to monitor when an IP changes on each of the 12 remote nodes. Each node streams its' new IP to a server in our Cupertino office. I maintain it and add more features to it.
A SQL Stored Procedure to check if records in our annotation table need to be updated, if so, it'll update the records that need the new data.
A data processing system to assist the Inclusion & Diversity team collect data for users of particular demographics.
A data processing system that requests new data from our DB, updates local data files(CSVs), for Tableau Workbooks that have a live connection to these files. Our local tableau system is essentially a backup to Tableau Server(it can be difficult to work sometimes).
I do a lot of adhoc data operation and analytics programming using pandas, NumPy, matplotlib, SQL Alchemy, and Python to do statistical computations or modify data in our DB.